Skip to content

123xiaochen/VCFshiny

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Abstract

Variants produced by next-generation sequencing are often recoded in variant call format (VCF) files. The VCF file stores the details of variations, including chromosome location, base sequence, base quality, read depth, genotype, etc. However, deciphering relevant biological insights from the VCF file can be a challenging task for studies without bioinformatics and programming backgrounds. Here, we described an R/Shiny application package named VCFshiny for interpreting and visualizing the variants stored in VCF files in an interactive and user-friendly way. VCFshiny enables the summary of variant information, including total variant numbers, variant overlap across samples, base alteration of single-nucleotide variants, length distribution of insertion and deletion, variant-related genes, variant distribution in the genome, and local variants in cancer driver genes. In each analysis session, we provided multiple visualization methods to help obtain an intuitive graph for publishing and sharing.

Getting Start

Requirements

(1). R (>= 4.2.0).
(2). Shiny (>= 1.6.0)

How to install shiny package:

## Open R ## you may need open R  first:
install.package("shiny")

How to install VCFshiny package:

## install.packages("devtools") ## you may need install devtools first
devtools::install_github("123xiaochen/VCFshiny")

Getting Start

## Loading and run the package.
library(VCFshiny)
VCFshiny::startVCFshiny()

Prepare Data

In this section, we will introduce how to prepare two different input data sets:

Source of VCF input data

The Variant Call Format (VCF) is used to record gene sequence variations. It is also the first file format to be understood for genome population correlation analysis. First, the whole genome sequencing file is mapping to the reference, and then the resulting bam file is comprehensively analyzed using variant calling software such as GATK and the reference genome data to produce the VCF result.

Source of TXT input data

TXT files are one of several output formats annotated by Annovar (Wang K, Li M, Hakonarson H. 2010), which is able to analyze genetic variations in various genomes using the latest data. Since the input data VCF file of Annovar software only contains the starting position of the mutation, it is necessary to adjust the input data before using, and add the end position of the mutation after the actual position of the mutation. Gene-based annotations reveal variant's direct relationship with known genes and its functional impact, while region-based annotations reveal Variant's relationship with specific segments of different genomes.

Input data requirements

The input file requires all data to be stored in a compressed folder in the format of the file name.

Input compress files requirements

(1). The compressed file name must be the same as the name of the compressed folder.
(2). The compressed file can be in *.tar. gz or *.zip format.

Input File Name Requirements

(1). The first box represents the sample name, which can be the group of experiments and the number of repetitions, connected by the character "-" or "_".
(2). The second box represents the data type, which can be snp or indel data. When snp and indel are not classified in the data, this box can be absent (I).
(3). The third box represents the data format, which can be vcf files, vcf. gz compressed files, and Annovar annotated TXT files.
(4). The contents of the three boxes are connected by ".".

Documentation

The documentation is available at here , the doc include a tutorial and example gallery.

Development

VCFshiny development takes place on Github: https://github.com/123xiaochen/VCFshiny

Please submit any reproducible bugs you encounter to the issue tracker

We will also put most commonly encountered issues in the FAQ page.