For any question about this repo, please contact Joe Glessner ([email protected]).
We propose a deep learning approach to remove the false positive CNV calls from SNP array and sequencing CNV detection programs. This repo constains the model code and an executable script with five sample inputs. Since the pre-trained model file exceeds the upload size of Github, it can be accessed by this external link. The dataset of this project is not for public. blended_learning.py is the training script. You can feed your own dataset to train the model using blended_learning.py.
perl visualize_cnv.pl -format plot -signal 200477520001_R06C01.baflrr 200477520001_R06C01.rawcnv
;
Typically the baflrr signal file has header: Name Chr Position sample.B Allele Freq sample.Log R Ratio;
--snpposfile NameChrPosition.txt
can be added if only "Name sample.B Allele Freq sample.Log R Ratio" columns provided in baflrr signal files;
PennCNV-Seq can be run on sequencing BAM/CRAM to generate baflrr files;
Rawcnv input file is: chr:start-stop numsnp=1 length=1 state2,cn=1 200477520001_R06C01.baflrr startsnp=a endsnp=b;
chr:start-stop and 200477520001_R06C01.baflrr are the only critical fields to be specified, making it easily adaptable to most CNV call output formats;
The 3 .py files in the main directory and script directory are the first version script implementations (except for visualize_cnv.pl which remains constant).
Array (Initial Implementation): ./blended_learning.py and ./script/run.py
Array (Published in Briefings in Bioinformatics Implementation): ./DeepCNVv2/blend.py and ./DeepCNVv2/blend_pred.py
Sequencing (Initial Implementation): ./predict.py
Sequencing (Published in Briefings in Bioinformatics Implementation): ./DeepCNV_Seqv2/train.py and ./DeepCNV_Seqv2/predict.py
The command line arguments differ. Array has blended learning using both JPGs with manual PASS/FAIL labeling and CNV Calling Quality Control Metric Values metadata from PennCNV detect_cnv.pl log file summary lines. Sequencing does not have blended learning, just training based on only JPGs with manual PASS/FAIL labeling and prediction of additional JPGs. See the column Command (Argument Descriptions) below.
DeepCNV Python Scripts | InputDataType | Command (Argument Descriptions) | Comment |
---|---|---|---|
./blended_learning.py | Array | python blended_learning.py ./data/JPG ./data/samples.csv ./DeepCNV.hdf5 ./output (1st argument: JPG_dir,2nd argument: metadata_dir,3rd argument: model_name, 4th argument: result_name) | Train and Test New Model and output performance statistics |
./script/run.py | Array | python script/run.py ./data/JPG ./data/samples.csv ./output ./DeepCNV.hdf5 (1st argument: image_dir, 2nd argument: metadata_dir, 3rd argument: output_dir, 4th argument: model_path) | Array Genotyping (Original) |
./predict.py | Sequencing | python predict.py ./input_x10 ./input_x10_output (1st argument: folder with input JPGs generated by visualize_cnv.pl, 2nd argument: output folder where res.csv is generated with pos and neg folders where corresponding JPGs are copied, NOTE: model_name defined in code as model/best_model_1_1.hdf5 but could be modified) | Sequencing Data (BAF/LRR signal files for each sample compiled by PennCNV-Seq convert_map2signal.pl based on the 1KG CRAMs.) |
./DeepCNVv2/blend.py | Array | python blend.py data/JPG/ data/ metadata.csv train_sample.csv val_sample.csv DeepCNV\(v.2\)/best_model_0.hdf5 res/res.csv (1st argument: img_folder, 2nd argument: metadata_folder, 3rd argument: metadata_file, 4th argument: train_id_file, 5th argument: val_id_file, 6th argument: saved_model_name, 7th argument: results_file) | To train the model, version Cheng Zhong used for the BIB paper |
./DeepCNVv2/blend_pred.py | Array | python blend_pred.py data/JPG/ data/ metadata.csv val_sample.csv DeepCNV\(v.2\)/best_model_0.hdf5 res/res.csv (1st argument: img_folder, 2nd argument: metadata_folder, 3rd argument: metadata_file, 4th argument: val_id_file, 5th argument: saved_model_name, 6th argument: results_file) | To make prediction with a pretrained model, version Cheng Zhong used for the BIB paper |
./DeepCNV_Seqv2/train.py | Sequencing | python blend.py data/ DeepCNV_Seq(v.2)/best_model_seq.hdf5 res/res.csv (1st argument: img_folder, 2nd argument: saved_model_name, 3rd argument: results_file) | To train the model, version Cheng Zhong used for the BIB paper |
./DeepCNV_Seqv2/predict.py | Sequencing | python predict.py data/ DeepCNV_Seq(v.2)/best_model_seq.hdf5 res/res.csv (1st argument: img_folder, 2nd argument: saved_model_name, 3rd argument: results_file) | To make prediction with a pretrained model, version Cheng Zhong used for the BIB paper |
This Array with metadata (./DeepCNVv2/blend.py) and Sequencing without metadata (./DeepCNV_Seqv2/train.py ) difference is why the CNN part of the model described in the article has two convolution blocks with 64, and then two more blocks with 128 feature layers (Array with metadata) vs. the model in ./DeepCNV_Seqv2/train.py only has one convolution block of these (Sequencing without metadata).
Optionally, if PennCNV-Seq convert_map2signal.pl is used on the BAMs/CRAMs of samples you have to make BAF/LRR signal files for each sample, then PennCNV detect_cnv.pl can be run on the sequencing data to produce CNV Calling Quality Control Metric Values metadata from PennCNV detect_cnv.pl log file summary lines. Then DeepCNV blended learning could be applied to the sequencing data in a more highly similar way to array data.
The JPG output from PennCNV visualize_cnv.pl is Size 900x900. We initially scaled that down to Size 300x300 to limit computational requirements and runtime.
In Code predict.py, DeepCNV_Seqv2/train.py, and DeepCNV_Seqv2/predict.py:
target_size = (300, 300)
In Subsequent code optimization and hdf5 model file size minimization, the image dimension downsizing was no longer needed as shown in Code blended_learning.py, DeepCNVv2/blend.py, DeepCNVv2/blend_pred.py:
dim=(900,900)
- Download the pre-trained model file from this link;
- Download script folder;
- Copy model file into script folder;
- Enter script folder from Terminal;
- Check the package requirments. Different package may generate different results;
- Create output folder by
mkdir output
; - Run
python run.py ./data/JPG ./data/samples.csv ./output ./DeepCNV.hdf5
; - Check the results from output folder.
python 2.7.12
pandas 0.17.1
numpy 1.11.0
tensorflow 1.12.0
keras 2.2.4
cv2 2.4.9.1
wget https://repo.anaconda.com/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh -b -p $HOME/miniconda2
pip install pandas numpy tensorflow==1.12.0 keras==2.2.4 opencv-python
Alternative Install required python libraries (strict version definition to prevent possibility of future version incompatibility)
pip install pandas==0.17.1 numpy==1.11.0 tensorflow==1.12.0 keras==2.2.4 cv2==2.4.9.1
DeepCNV Python Script | DeepCNV hdf5 Model File | Command (Full) | Command (Short) | Comment | Date Modified |
---|---|---|---|---|---|
run_DeepCNV_3.py | Joe_Batch1To6_model.h5 (also named DeepCNV.hdf5) | python run_DeepCNV_3.py ./input_x10 ./input_x10_output | python run_DeepCNV_3.py ./input_x10 ./input_x10_output (1st argument: folder with input JPGs generated by visualize_cnv.pl, 2nd argument: output folder where res.csv is generated with pos and neg folders where corresponding JPGs are copied) | Array Genotyping (Original) | 9/4/2019 |
predict.py | model/best_model_1_1.hdf5 (also named DeepCNVSeq.hdf5) (model_name defined in code) | python predict.py ./input_x10 ./input_x10_output (Put positive images in input_x10/1 and negative images in input_x10/0) | python predict.py ./input_x10 ./input_x10_output (1st argument: folder with input JPGs generated by visualize_cnv.pl, 2nd argument: output folder where res.csv is generated with pos and neg folders where corresponding JPGs are copied) | Sequencing Data (BAF/LRR signal files for each sample compiled by PennCNV-Seq convert_map2signal.pl based on the 1KG CRAMs.) | 8/7/2020 |
script/run.py | DeepCNV.hdf5 | python script/run.py ./data/JPG ./data/samples.csv ./output ./DeepCNV.hdf5 | python script/run.py ./data/JPG ./data/samples.csv ./output ./DeepCNV.hdf5 (1st argument: image_dir, 2nd argument: metadata_dir, 3rd argument: output_dir, 4th argument: model_path) | . | 11/22/2019 |
blended_learning.py | NA | python blended_learning.py ./data/JPG ./data/samples.csv ./DeepCNV.hdf5 ./output | python blended_learning.py ./data/JPG ./data/samples.csv ./DeepCNV.hdf5 ./output (1st argument: JPG_dir,2nd argument: metadata_dir,3rd argument: model_name, 4th argument: result_name) | . | 11/22/2019 |
DeepCNVv2/blend.py | DeepCNVv2/best_model_0.hdf5 | python blend.py data/JPG/ data/ metadata.csv train_sample.csv val_sample.csv res/bmodel.hdf5 res/res.csv | python blend.py img_folder metadata_folder metadata_file train_id_file val_id_file saved_model_name results_file | version Cheng Zhong used for the BIB paper | 11/7/2020 |
DeepCNVv2/blend_pred.py | DeepCNVv2/best_model_0.hdf5 | python blend_pred.py data/JPG/ data/ metadata.csv val_sample.csv best_model_0.hdf5 res/res.csv | python blend_pred.py img_folder metadata_folder metadata_file val_id_file saved_model_name results_file | version Cheng Zhong used for the BIB paper | 11/6/2020 |
DeepCNV_Seqv2/train.py | DeepCNV_Seqv2/best_model_seq.hdf5 | python blend.py data/ res/model.hdf5 res/res.csv | python train.py img_folder saved_model_name results_file | version Cheng Zhong used for the BIB paper | 11/9/2020 |
DeepCNV_Seqv2/predict.py | DeepCNV_Seqv2/best_model_seq.hdf5 | python predict.py data/ res/model.hdf5 res/res.csv | python predict.py img_folder saved_model_name results_file | version Cheng Zhong used for the BIB paper | 11/8/2020 |
Model File | Size | Month | Day | Year |
---|---|---|---|---|
Joe_Batch1To6_model.h5 | 65M | Apr | 8 | 2019 |
Batch4_train_on_all.h5 | 149M | Nov | 20 | 2018 |
Batch4_2.h5 | 149M | Nov | 20 | 2018 |
DeepCNVSeq.hdf5 | 4.8M | Aug | 7 | 2020 |
Batch4_2.h5 | 149M | Nov | 20 | 2018 |
model6.h5 | 149M | Sep | 13 | 2018 |
best_model_0.hdf5 | 19M | Nov | 6 | 2020 |
best_model_seq.hdf5 | 3.2M | Mar | 26 | 2020 |