Skip to content
forked from CAG-CNV/DeepCNV

DeepCNV: A Deep Learning Approach for Authenticating Copy Number Variants

Notifications You must be signed in to change notification settings

NTNguyen13/DeepCNV

 
 

Repository files navigation

DeepCNV: A Deep Learning Approach for Authenticating Copy Number Variants

For any question about this repo, please contact Joe Glessner ([email protected]).

Description

We propose a deep learning approach to remove the false positive CNV calls from SNP array and sequencing CNV detection programs. This repo constains the model code and an executable script with five sample inputs. Since the pre-trained model file exceeds the upload size of Github, it can be accessed by this external link. The dataset of this project is not for public. blended_learning.py is the training script. You can feed your own dataset to train the model using blended_learning.py.

Generate plot images for script

perl visualize_cnv.pl -format plot -signal 200477520001_R06C01.baflrr 200477520001_R06C01.rawcnv;
Typically the baflrr signal file has header: Name Chr Position sample.B Allele Freq sample.Log R Ratio;
--snpposfile NameChrPosition.txt can be added if only "Name sample.B Allele Freq sample.Log R Ratio" columns provided in baflrr signal files;
PennCNV-Seq can be run on sequencing BAM/CRAM to generate baflrr files;
Rawcnv input file is: chr:start-stop numsnp=1 length=1 state2,cn=1 200477520001_R06C01.baflrr startsnp=a endsnp=b;
chr:start-stop and 200477520001_R06C01.baflrr are the only critical fields to be specified, making it easily adaptable to most CNV call output formats;

Different Script Purposes and Iterations

The 3 .py files in the main directory and script directory are the first version script implementations (except for visualize_cnv.pl which remains constant).
Array (Initial Implementation): ./blended_learning.py and ./script/run.py
Array (Published in Briefings in Bioinformatics Implementation): ./DeepCNVv2/blend.py and ./DeepCNVv2/blend_pred.py
Sequencing (Initial Implementation): ./predict.py
Sequencing (Published in Briefings in Bioinformatics Implementation): ./DeepCNV_Seqv2/train.py and ./DeepCNV_Seqv2/predict.py

The command line arguments differ. Array has blended learning using both JPGs with manual PASS/FAIL labeling and CNV Calling Quality Control Metric Values metadata from PennCNV detect_cnv.pl log file summary lines. Sequencing does not have blended learning, just training based on only JPGs with manual PASS/FAIL labeling and prediction of additional JPGs. See the column Command (Argument Descriptions) below.

DeepCNV Python Scripts InputDataType Command (Argument Descriptions) Comment
./blended_learning.py Array python blended_learning.py ./data/JPG ./data/samples.csv ./DeepCNV.hdf5 ./output (1st argument: JPG_dir,2nd argument: metadata_dir,3rd argument: model_name, 4th argument: result_name) Train and Test New Model and output performance statistics
./script/run.py Array python script/run.py ./data/JPG ./data/samples.csv ./output ./DeepCNV.hdf5 (1st argument: image_dir, 2nd argument: metadata_dir, 3rd argument: output_dir, 4th argument: model_path) Array Genotyping (Original)
./predict.py Sequencing python predict.py ./input_x10 ./input_x10_output (1st argument: folder with input JPGs generated by visualize_cnv.pl, 2nd argument: output folder where res.csv is generated with pos and neg folders where corresponding JPGs are copied, NOTE: model_name defined in code as model/best_model_1_1.hdf5 but could be modified) Sequencing Data (BAF/LRR signal files for each sample compiled by PennCNV-Seq convert_map2signal.pl based on the 1KG CRAMs.)
./DeepCNVv2/blend.py Array python blend.py data/JPG/ data/ metadata.csv train_sample.csv val_sample.csv DeepCNV\(v.2\)/best_model_0.hdf5 res/res.csv (1st argument: img_folder, 2nd argument: metadata_folder, 3rd argument: metadata_file, 4th argument: train_id_file, 5th argument: val_id_file, 6th argument: saved_model_name, 7th argument: results_file) To train the model, version Cheng Zhong used for the BIB paper
./DeepCNVv2/blend_pred.py Array python blend_pred.py data/JPG/ data/ metadata.csv val_sample.csv DeepCNV\(v.2\)/best_model_0.hdf5 res/res.csv (1st argument: img_folder, 2nd argument: metadata_folder, 3rd argument: metadata_file, 4th argument: val_id_file, 5th argument: saved_model_name, 6th argument: results_file) To make prediction with a pretrained model, version Cheng Zhong used for the BIB paper
./DeepCNV_Seqv2/train.py Sequencing python blend.py data/ DeepCNV_Seq(v.2)/best_model_seq.hdf5 res/res.csv (1st argument: img_folder, 2nd argument: saved_model_name, 3rd argument: results_file) To train the model, version Cheng Zhong used for the BIB paper
./DeepCNV_Seqv2/predict.py Sequencing python predict.py data/ DeepCNV_Seq(v.2)/best_model_seq.hdf5 res/res.csv (1st argument: img_folder, 2nd argument: saved_model_name, 3rd argument: results_file) To make prediction with a pretrained model, version Cheng Zhong used for the BIB paper

Array vs. Sequencing based input impact on algorithm convolution blocks

This Array with metadata (./DeepCNVv2/blend.py) and Sequencing without metadata (./DeepCNV_Seqv2/train.py ) difference is why the CNN part of the model described in the article has two convolution blocks with 64, and then two more blocks with 128 feature layers (Array with metadata) vs. the model in ./DeepCNV_Seqv2/train.py only has one convolution block of these (Sequencing without metadata).
Optionally, if PennCNV-Seq convert_map2signal.pl is used on the BAMs/CRAMs of samples you have to make BAF/LRR signal files for each sample, then PennCNV detect_cnv.pl can be run on the sequencing data to produce CNV Calling Quality Control Metric Values metadata from PennCNV detect_cnv.pl log file summary lines. Then DeepCNV blended learning could be applied to the sequencing data in a more highly similar way to array data.

Image Dimensions

The JPG output from PennCNV visualize_cnv.pl is Size 900x900. We initially scaled that down to Size 300x300 to limit computational requirements and runtime.
In Code predict.py, DeepCNV_Seqv2/train.py, and DeepCNV_Seqv2/predict.py:
target_size = (300, 300)
In Subsequent code optimization and hdf5 model file size minimization, the image dimension downsizing was no longer needed as shown in Code blended_learning.py, DeepCNVv2/blend.py, DeepCNVv2/blend_pred.py:
dim=(900,900)

Run script

  1. Download the pre-trained model file from this link;
  2. Download script folder;
  3. Copy model file into script folder;
  4. Enter script folder from Terminal;
  5. Check the package requirments. Different package may generate different results;
  6. Create output folder by mkdir output;
  7. Run python run.py ./data/JPG ./data/samples.csv ./output ./DeepCNV.hdf5;
  8. Check the results from output folder.

Package Requirments

python 2.7.12
pandas 0.17.1
numpy 1.11.0
tensorflow 1.12.0
keras 2.2.4
cv2 2.4.9.1

Download Miniconda2 installer

wget https://repo.anaconda.com/miniconda/Miniconda2-latest-Linux-x86_64.sh 

Install in Silent Mode 

bash Miniconda2-latest-Linux-x86_64.sh -b -p $HOME/miniconda2 

Install required python libraries

pip install pandas numpy tensorflow==1.12.0 keras==2.2.4 opencv-python

Alternative Install required python libraries (strict version definition to prevent possibility of future version incompatibility)

pip install pandas==0.17.1 numpy==1.11.0 tensorflow==1.12.0 keras==2.2.4 cv2==2.4.9.1

DeepCNV Python Script DeepCNV hdf5 Model File Command (Full) Command (Short) Comment Date Modified
run_DeepCNV_3.py Joe_Batch1To6_model.h5 (also named DeepCNV.hdf5) python run_DeepCNV_3.py ./input_x10 ./input_x10_output python run_DeepCNV_3.py ./input_x10 ./input_x10_output (1st argument: folder with input JPGs generated by visualize_cnv.pl, 2nd argument: output folder where res.csv is generated with pos and neg folders where corresponding JPGs are copied) Array Genotyping (Original) 9/4/2019
predict.py model/best_model_1_1.hdf5 (also named DeepCNVSeq.hdf5) (model_name defined in code) python predict.py ./input_x10 ./input_x10_output (Put positive images in input_x10/1 and negative images in input_x10/0) python predict.py ./input_x10 ./input_x10_output (1st argument: folder with input JPGs generated by visualize_cnv.pl, 2nd argument: output folder where res.csv is generated with pos and neg folders where corresponding JPGs are copied) Sequencing Data (BAF/LRR signal files for each sample compiled by PennCNV-Seq convert_map2signal.pl based on the 1KG CRAMs.) 8/7/2020
script/run.py DeepCNV.hdf5 python script/run.py ./data/JPG ./data/samples.csv ./output ./DeepCNV.hdf5 python script/run.py ./data/JPG ./data/samples.csv ./output ./DeepCNV.hdf5 (1st argument: image_dir, 2nd argument: metadata_dir, 3rd argument: output_dir, 4th argument: model_path) . 11/22/2019
blended_learning.py NA python blended_learning.py ./data/JPG ./data/samples.csv ./DeepCNV.hdf5 ./output python blended_learning.py ./data/JPG ./data/samples.csv ./DeepCNV.hdf5 ./output (1st argument: JPG_dir,2nd argument: metadata_dir,3rd argument: model_name, 4th argument: result_name) . 11/22/2019
DeepCNVv2/blend.py DeepCNVv2/best_model_0.hdf5 python blend.py data/JPG/ data/ metadata.csv train_sample.csv val_sample.csv res/bmodel.hdf5 res/res.csv python blend.py img_folder metadata_folder metadata_file train_id_file val_id_file saved_model_name results_file version Cheng Zhong used for the BIB paper 11/7/2020
DeepCNVv2/blend_pred.py DeepCNVv2/best_model_0.hdf5 python blend_pred.py data/JPG/ data/ metadata.csv val_sample.csv best_model_0.hdf5 res/res.csv python blend_pred.py img_folder metadata_folder metadata_file val_id_file saved_model_name results_file version Cheng Zhong used for the BIB paper 11/6/2020
DeepCNV_Seqv2/train.py DeepCNV_Seqv2/best_model_seq.hdf5 python blend.py data/ res/model.hdf5 res/res.csv python train.py img_folder saved_model_name results_file version Cheng Zhong used for the BIB paper 11/9/2020
DeepCNV_Seqv2/predict.py DeepCNV_Seqv2/best_model_seq.hdf5 python predict.py data/ res/model.hdf5 res/res.csv python predict.py img_folder saved_model_name results_file version Cheng Zhong used for the BIB paper 11/8/2020
Model File Size Month Day Year
Joe_Batch1To6_model.h5 65M Apr 8 2019
Batch4_train_on_all.h5 149M Nov 20 2018
Batch4_2.h5 149M Nov 20 2018
DeepCNVSeq.hdf5 4.8M Aug 7 2020
Batch4_2.h5 149M Nov 20 2018
model6.h5 149M Sep 13 2018
best_model_0.hdf5 19M Nov 6 2020
best_model_seq.hdf5 3.2M Mar 26 2020

About

DeepCNV: A Deep Learning Approach for Authenticating Copy Number Variants

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Perl 60.0%
  • Python 40.0%