Variants produced by next-generation sequencing are often recoded in variant call format (VCF) files. The VCF file stores the details of variations, including chromosome location, base sequence, base quality, read depth, genotype, etc. However, deciphering relevant biological insights from the VCF file can be a challenging task for studies without bioinformatics and programming backgrounds. Here, we described an R/Shiny application package named VCFshiny for interpreting and visualizing the variants stored in VCF files in an interactive and user-friendly way. VCFshiny enables the summary of variant information, including total variant numbers, variant overlap across samples, base alteration of single-nucleotide variants, length distribution of insertion and deletion, variant-related genes, variant distribution in the genome, and local variants in cancer driver genes. In each analysis session, we provided multiple visualization methods to help obtain an intuitive graph for publishing and sharing.
(1). R (>= 4.2.0).(2). Shiny (>= 1.6.0)
## Open R ## you may need open R first:
install.package("shiny")
## install.packages("devtools") ## you may need install devtools first
devtools::install_github("123xiaochen/VCFshiny")
## Loading and run the package.
library(VCFshiny)
VCFshiny::startVCFshiny()
In this section, we will introduce how to prepare two different input data sets:
The Variant Call Format (VCF) is used to record gene sequence variations. It is also the first file format to be understood for genome population correlation analysis. First, the whole genome sequencing file is mapping to the reference, and then the resulting bam file is comprehensively analyzed using variant calling software such as GATK and the reference genome data to produce the VCF result.
TXT files are one of several output formats annotated by Annovar (Wang K, Li M, Hakonarson H. 2010), which is able to analyze genetic variations in various genomes using the latest data. Since the input data VCF file of Annovar software only contains the starting position of the mutation, it is necessary to adjust the input data before using, and add the end position of the mutation after the actual position of the mutation. Gene-based annotations reveal variant's direct relationship with known genes and its functional impact, while region-based annotations reveal Variant's relationship with specific segments of different genomes.
The input file requires all data to be stored in a compressed folder in the format of the file name.
(1). The compressed file name must be the same as the name of the compressed folder.
(2). The compressed file can be in *.tar. gz or *.zip format.
(1). The first box represents the sample name, which can be the group of experiments and the number of repetitions, connected by the character "-" or "_".
(2). The second box represents the data type, which can be snp or indel data. When snp and indel are not classified in the data, this box can be absent (I).
(3). The third box represents the data format, which can be vcf files, vcf. gz compressed files, and Annovar annotated TXT files.
(4). The contents of the three boxes are connected by ".".
The documentation is available at here , the doc include a tutorial and example gallery.
VCFshiny development takes place on Github: https://github.com/123xiaochen/VCFshiny
Please submit any reproducible bugs you encounter to the issue tracker
We will also put most commonly encountered issues in the FAQ page.