Variant Ensemble Filter, an ensemble based filter for VCF files.
VEF is designed for filtering variants of single non-cancerous samples.
bioRxiv: https://doi.org/10.1101/540286
All VCF files used for analysis of VEF can be accessed here: https://doi.org/10.13012/B2IDB-9401259_V1
Installation in a virtual environment is recommended, e.g., conda.
If you're not using a virtual environment and encounter permission issues when installing packages, please try installing with --user
option, e.g., pip install --user -r requirement.txt
.
-
Download
git clone https://github.com/ChuanyiZ/vef.git
-
install requirements
cd vef pip install -r requirements.txt
-
install scikit-allel
pip install scikit-allel
-
install VEF using pip
pip install ./
Example python scripts are located in the example/
directory. There's a test.sh
script to run them. VCF files for testing are located in example/data
, including chromosome 11 and 20 of Human sample NA12878 (HG001) (SNPs and INDELs are already separated). You can also download the corresponding BAM files from ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/10XGenomics/ and generate the VCF files on your own by running GATK's best practice pipeline.
cd example
./test.sh
This bash script calls example script vef_clf.py
and vef_apl.py
located in the example/
directory to train and apply the filter, respectively. After running test.sh
, there will be 2 filter models in example/data
with the .clf
extension, along with 2 filtered VCF files with the .vef.vcf
extension.
The generated *.vef.vcf
files include a VEF
score in the INFO field for each variant. This score represents the probability of the variant being correct, calculated by the model. In particular, if VEF
score. VEF also specifies VEF_FILTERED
in the FILTER field of the variants that VEF estimates are incorrectly called.
Use GATK SelectVariants tool to split the target VCF file into two files containing SNPs and INDELs separately. For example, if we have a target VCF file specimen.vcf
:
java -jar GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V specimen.vcf -selectType SNP -o specimen.snp.vcf
java -jar GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V specimen.vcf -selectType INDEL -o specimen.indel.vcf
For detailed documentation of SelectVariants
, please go to GATK's site
Input files:
gold_standard.vcf
,gold_standard.bed
,reference.fasta
: for example, GIAB's NA12878 Gold Standard's VCF and BED files, and the corresponding reference file.specimen.snp.vcf
orspecimen.indel.vcf
: for example, a VCF file from sample NA12878 produced by the sequencing technology and analysis pipeline of interest.
Steps:
-
Use hap.py to compare
specimen.snp.vcf
(orspecimen.indel.vcf
) against thegold_standard.vcf
, and outputcompare.snp.vcf
(orcompare.indel.vcf
). Example:python2 hap.py gold_standard.vcf specimen.snp.vcf -f gold_standard.bed -o compare.snp.vcf -r reference.fasta --no-roc
-
Use
vef_clf.py
in theexample/
folder (recommended), where you can specify number of trees (default 150) and the ensemble method (default RF).python vef_clf.py --happy compare.snp.vcf --target specimen.snp.vcf --mode SNP --kind <MODEL_NAME> --n <NUM_TREES> required named arguments: --happy HAPPY annotated training VCF file (from hap.py) --target TARGET training VCF file --mode {SNP,INDEL} mode: SNP or INDEL optional arguments: -n NUM_TREES, --num_trees NUM_TREES number of trees, default = 150 --kind {RF,RandomForest,AB,AdaBoost,GB,GradientBoost} kind of ensemble method, available options: RandomForest (RF), AdaBoost (AB), GradientBoost(GB); default = RF -h, --help show this help message and exit
MODE must be either SNP or INDEL, model is RF (Random Forest), AB (AdaBoost), or GB (GradientBoost), and NUM_TREES should be an integer. By default, RF with 150 base learners is used. See help message for details.
Otherwise you can directly write your own Python script, example snippet as follows.
from vef import VCFDataset, Classifier
dataset = VCFDataset(path_to_compare_vcf, path_to_specimen_vcf, 'SNP')
X, y = dataset.get_dataset('*')
clf = Classifier(dataset.features, num_trees, "RF")
clf.fit(X, y)
clf.save(path_to_specimen_vcf + ".vef_snp.clf")
Input files:
classifier.clf
: pre-trained classifier.target.snp.vcf
ortarget.indel.vcf
: target VCF file.
Use vef_apl.py
in example/
folder (recommended).
python vef_apl.py --clf_file classifier.clf --subject target.snp.vcf --suffix vef --mode SNP
optional arguments:
-h, --help show this help message and exit
required named arguments:
--clf_file CLF_FILE VEF filter pre-trained model
--subject SUBJECT target VCF file
--suffix SUFFIX suffix of filtered VCF file
--mode {SNP,INDEL} mode: SNP or INDEL
MODE must be either SNP or INDEL.
Alternatively, use the VEF package to apply the pre-trained model on the target VCF file. Example snippet as follows.
from vef import VCFApply, Classifier
clf = Classifier.load(clf_file)
apply = VCFApply(target_vcf, clf, 'SNP')
apply.apply()
apply.write_filtered(target_vcf + ".vef_snp.vcf")
See example script vef_grid_search.py
in the example/
folder.
This script will run K-fold cross-validation on the given VCF files and determine the best number of trees and learning rate.
Input files: (same as training)
gold_standard.vcf
,gold_standard.bed
,reference.fasta
: for example, GIAB's NA12878 Gold Standard's VCF and BED files, and the corresponding reference file.specimen.snp.vcf
orspecimen.indel.vcf
: for example, a VCF file from sample NA12878 produced by the sequencing technology and analysis pipeline of interest.
Steps: (similar to training)
-
Use hap.py to compare
specimen.snp.vcf
(orspecimen.indel.vcf
) againstgold_standard.vcf
, outputcompare.snp.vcf
(orcompare.indel.vcf
).python2 hap.py gold_standard.vcf specimen.vcf -f gold_standard.bed -o compare.vcf -r reference.fasta --no-roc
-
Use
vef_grid_search.py
inexample/
folder, where you can specify the ensemble method (default RF).python vef_grid_search.py --happy compare.snp.vcf --target specimen.snp.vcf --mode SNP --fold 5 --jobs 2 --kind <MODEL_NAME> optional arguments: -h, --help show this help message and exit required named arguments: --happy HAPPY annotated training VCF file (by hap.py) --target TARGET training VCF file --mode MODE mode: SNP or INDEL optional arguments: --fold FOLD number of k-fold cross validation, default = 5 --jobs JOBS number of parallel processes, default = 1 --kind {RF,RandomForest,AB,AdaBoost,GB,GradientBoost} kind of ensemble method, available options: RandomForest (RF), AdaBoost (AB), GradientBoost(GB); default = RF
MODE must be either SNP or INDEL; FOLD and JOBS are integers, and the number of parallel jobs (processes) should be less than or equal to the number of virtual cores on your computer.