DeepCNV: A Deep Learning Approach for Authenticating Copy Number Variants

For any question about this repo, please contact Joe Glessner ([email protected]).

Description

We propose a deep learning approach to remove the false positive CNV calls from SNP array and sequencing CNV detection programs. This repo constains the model code and an executable script with five sample inputs. Since the pre-trained model file exceeds the upload size of Github, it can be accessed by this external link. The dataset of this project is not for public. blended_learning.py is the training script. You can feed your own dataset to train the model using blended_learning.py.

Generate plot images for script

perl visualize_cnv.pl -format plot -signal 200477520001_R06C01.baflrr 200477520001_R06C01.rawcnv;
Typically the baflrr signal file has header: Name Chr Position sample.B Allele Freq sample.Log R Ratio;
--snpposfile NameChrPosition.txt can be added if only "Name sample.B Allele Freq sample.Log R Ratio" columns provided in baflrr signal files;
PennCNV-Seq can be run on sequencing BAM/CRAM to generate baflrr files;
Rawcnv input file is: chr:start-stop numsnp=1 length=1 state2,cn=1 200477520001_R06C01.baflrr startsnp=a endsnp=b;
chr:start-stop and 200477520001_R06C01.baflrr are the only critical fields to be specified, making it easily adaptable to most CNV call output formats;

Different Script Purposes and Iterations

The 3 .py files in the main directory and script directory are the first version script implementations (except for visualize_cnv.pl which remains constant).
Array (Initial Implementation): ./blended_learning.py and ./script/run.py
Array (Published in Briefings in Bioinformatics Implementation): ./DeepCNVv2/blend.py and ./DeepCNVv2/blend_pred.py
Sequencing (Initial Implementation): ./predict.py
Sequencing (Published in Briefings in Bioinformatics Implementation): ./DeepCNV_Seqv2/train.py and ./DeepCNV_Seqv2/predict.py

The command line arguments differ. Array has blended learning using both JPGs with manual PASS/FAIL labeling and CNV Calling Quality Control Metric Values metadata from PennCNV detect_cnv.pl log file summary lines. Sequencing does not have blended learning, just training based on only JPGs with manual PASS/FAIL labeling and prediction of additional JPGs. See the column Command (Argument Descriptions) below.

DeepCNV Python Scripts	InputDataType	Command (Argument Descriptions)	Comment
./blended_learning.py	Array	python blended_learning.py ./data/JPG ./data/samples.csv ./DeepCNV.hdf5 ./output (1st argument: JPG_dir,2nd argument: metadata_dir,3rd argument: model_name, 4th argument: result_name)	Train and Test New Model and output performance statistics
./script/run.py	Array	python script/run.py ./data/JPG ./data/samples.csv ./output ./DeepCNV.hdf5 (1st argument: image_dir, 2nd argument: metadata_dir, 3rd argument: output_dir, 4th argument: model_path)	Array Genotyping (Original)
./predict.py	Sequencing	python predict.py ./input_x10 ./input_x10_output (1st argument: folder with input JPGs generated by visualize_cnv.pl, 2nd argument: output folder where res.csv is generated with pos and neg folders where corresponding JPGs are copied, NOTE: model_name defined in code as model/best_model_1_1.hdf5 but could be modified)	Sequencing Data (BAF/LRR signal files for each sample compiled by PennCNV-Seq convert_map2signal.pl based on the 1KG CRAMs.)
./DeepCNVv2/blend.py	Array	python blend.py data/JPG/ data/ metadata.csv train_sample.csv val_sample.csv DeepCNV$v.2$/best_model_0.hdf5 res/res.csv (1st argument: img_folder, 2nd argument: metadata_folder, 3rd argument: metadata_file, 4th argument: train_id_file, 5th argument: val_id_file, 6th argument: saved_model_name, 7th argument: results_file)	To train the model, version Cheng Zhong used for the BIB paper
./DeepCNVv2/blend_pred.py	Array	python blend_pred.py data/JPG/ data/ metadata.csv val_sample.csv DeepCNV$v.2$/best_model_0.hdf5 res/res.csv (1st argument: img_folder, 2nd argument: metadata_folder, 3rd argument: metadata_file, 4th argument: val_id_file, 5th argument: saved_model_name, 6th argument: results_file)	To make prediction with a pretrained model, version Cheng Zhong used for the BIB paper
./DeepCNV_Seqv2/train.py	Sequencing	python blend.py data/ DeepCNV_Seq(v.2)/best_model_seq.hdf5 res/res.csv (1st argument: img_folder, 2nd argument: saved_model_name, 3rd argument: results_file)	To train the model, version Cheng Zhong used for the BIB paper
./DeepCNV_Seqv2/predict.py	Sequencing	python predict.py data/ DeepCNV_Seq(v.2)/best_model_seq.hdf5 res/res.csv (1st argument: img_folder, 2nd argument: saved_model_name, 3rd argument: results_file)	To make prediction with a pretrained model, version Cheng Zhong used for the BIB paper

Array vs. Sequencing based input impact on algorithm convolution blocks

This Array with metadata (./DeepCNVv2/blend.py) and Sequencing without metadata (./DeepCNV_Seqv2/train.py ) difference is why the CNN part of the model described in the article has two convolution blocks with 64, and then two more blocks with 128 feature layers (Array with metadata) vs. the model in ./DeepCNV_Seqv2/train.py only has one convolution block of these (Sequencing without metadata).
Optionally, if PennCNV-Seq convert_map2signal.pl is used on the BAMs/CRAMs of samples you have to make BAF/LRR signal files for each sample, then PennCNV detect_cnv.pl can be run on the sequencing data to produce CNV Calling Quality Control Metric Values metadata from PennCNV detect_cnv.pl log file summary lines. Then DeepCNV blended learning could be applied to the sequencing data in a more highly similar way to array data.

Image Dimensions

The JPG output from PennCNV visualize_cnv.pl is Size 900x900. We initially scaled that down to Size 300x300 to limit computational requirements and runtime.
In Code predict.py, DeepCNV_Seqv2/train.py, and DeepCNV_Seqv2/predict.py:
target_size = (300, 300)
In Subsequent code optimization and hdf5 model file size minimization, the image dimension downsizing was no longer needed as shown in Code blended_learning.py, DeepCNVv2/blend.py, DeepCNVv2/blend_pred.py:
dim=(900,900)

Run script

Download the pre-trained model file from this link;
Download script folder;
Copy model file into script folder;
Enter script folder from Terminal;
Check the package requirments. Different package may generate different results;
Create output folder by mkdir output;
Run python run.py ./data/JPG ./data/samples.csv ./output ./DeepCNV.hdf5;
Check the results from output folder.

Package Requirments

python 2.7.12
pandas 0.17.1
numpy 1.11.0
tensorflow 1.12.0
keras 2.2.4
cv2 2.4.9.1

Download Miniconda2 installer

wget https://repo.anaconda.com/miniconda/Miniconda2-latest-Linux-x86_64.sh

Install in Silent Mode

bash Miniconda2-latest-Linux-x86_64.sh -b -p $HOME/miniconda2

Install required python libraries

pip install pandas numpy tensorflow==1.12.0 keras==2.2.4 opencv-python

Alternative Install required python libraries (strict version definition to prevent possibility of future version incompatibility)

pip install pandas==0.17.1 numpy==1.11.0 tensorflow==1.12.0 keras==2.2.4 cv2==2.4.9.1

DeepCNV Python Script	DeepCNV hdf5 Model File	Command (Full)	Command (Short)	Comment	Date Modified
run_DeepCNV_3.py	Joe_Batch1To6_model.h5 (also named DeepCNV.hdf5)	python run_DeepCNV_3.py ./input_x10 ./input_x10_output	python run_DeepCNV_3.py ./input_x10 ./input_x10_output (1st argument: folder with input JPGs generated by visualize_cnv.pl, 2nd argument: output folder where res.csv is generated with pos and neg folders where corresponding JPGs are copied)	Array Genotyping (Original)	9/4/2019
predict.py	model/best_model_1_1.hdf5 (also named DeepCNVSeq.hdf5) (model_name defined in code)	python predict.py ./input_x10 ./input_x10_output (Put positive images in input_x10/1 and negative images in input_x10/0)	python predict.py ./input_x10 ./input_x10_output (1st argument: folder with input JPGs generated by visualize_cnv.pl, 2nd argument: output folder where res.csv is generated with pos and neg folders where corresponding JPGs are copied)	Sequencing Data (BAF/LRR signal files for each sample compiled by PennCNV-Seq convert_map2signal.pl based on the 1KG CRAMs.)	8/7/2020
script/run.py	DeepCNV.hdf5	python script/run.py ./data/JPG ./data/samples.csv ./output ./DeepCNV.hdf5	python script/run.py ./data/JPG ./data/samples.csv ./output ./DeepCNV.hdf5 (1st argument: image_dir, 2nd argument: metadata_dir, 3rd argument: output_dir, 4th argument: model_path)	.	11/22/2019
blended_learning.py	NA	python blended_learning.py ./data/JPG ./data/samples.csv ./DeepCNV.hdf5 ./output	python blended_learning.py ./data/JPG ./data/samples.csv ./DeepCNV.hdf5 ./output (1st argument: JPG_dir,2nd argument: metadata_dir,3rd argument: model_name, 4th argument: result_name)	.	11/22/2019
DeepCNVv2/blend.py	DeepCNVv2/best_model_0.hdf5	python blend.py data/JPG/ data/ metadata.csv train_sample.csv val_sample.csv res/bmodel.hdf5 res/res.csv	python blend.py img_folder metadata_folder metadata_file train_id_file val_id_file saved_model_name results_file	version Cheng Zhong used for the BIB paper	11/7/2020
DeepCNVv2/blend_pred.py	DeepCNVv2/best_model_0.hdf5	python blend_pred.py data/JPG/ data/ metadata.csv val_sample.csv best_model_0.hdf5 res/res.csv	python blend_pred.py img_folder metadata_folder metadata_file val_id_file saved_model_name results_file	version Cheng Zhong used for the BIB paper	11/6/2020
DeepCNV_Seqv2/train.py	DeepCNV_Seqv2/best_model_seq.hdf5	python blend.py data/ res/model.hdf5 res/res.csv	python train.py img_folder saved_model_name results_file	version Cheng Zhong used for the BIB paper	11/9/2020
DeepCNV_Seqv2/predict.py	DeepCNV_Seqv2/best_model_seq.hdf5	python predict.py data/ res/model.hdf5 res/res.csv	python predict.py img_folder saved_model_name results_file	version Cheng Zhong used for the BIB paper	11/8/2020

Model File	Size	Month	Day	Year
Joe_Batch1To6_model.h5	65M	Apr	8	2019
Batch4_train_on_all.h5	149M	Nov	20	2018
Batch4_2.h5	149M	Nov	20	2018
DeepCNVSeq.hdf5	4.8M	Aug	7	2020
Batch4_2.h5	149M	Nov	20	2018
model6.h5	149M	Sep	13	2018
best_model_0.hdf5	19M	Nov	6	2020
best_model_seq.hdf5	3.2M	Mar	26	2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepCNV: A Deep Learning Approach for Authenticating Copy Number Variants

Description

Generate plot images for script

Different Script Purposes and Iterations

Array vs. Sequencing based input impact on algorithm convolution blocks

Image Dimensions

Run script

Package Requirments

Download Miniconda2 installer

Install in Silent Mode

Install required python libraries

Alternative Install required python libraries (strict version definition to prevent possibility of future version incompatibility)

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
DeepCNV_Seqv2		DeepCNV_Seqv2
DeepCNVv2		DeepCNVv2
script		script
DeepCNVSeq.hdf5		DeepCNVSeq.hdf5
README.md		README.md
blended_learning.py		blended_learning.py
predict.py		predict.py
visualize_cnv.pl		visualize_cnv.pl

NTNguyen13/DeepCNV

Folders and files

Latest commit

History

Repository files navigation

DeepCNV: A Deep Learning Approach for Authenticating Copy Number Variants

Description

Generate plot images for script

Different Script Purposes and Iterations

Array vs. Sequencing based input impact on algorithm convolution blocks

Image Dimensions

Run script

Package Requirments

Download Miniconda2 installer

Install in Silent Mode

Install required python libraries

Alternative Install required python libraries (strict version definition to prevent possibility of future version incompatibility)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages