Skip to content

jbelyeu/samplot-ml

 
 

Repository files navigation

samplot-ml

Training Data Workflow

1000 genomes high coverage crams

Download the Cram indices (.crai) listed on the high_coverage index ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/1000genomes.high_coverage.GRCh38DH.alignment.index

  • Aligned to GRCh38 reference genome

Get the SV callset VCF (using GRCh38 ref)

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/supporting/GRCh38_positions/ALL.wgs.integrated_sv_map_v1_GRCh38.20130502.svs.genotypes.vcf.gz

Extract DEL/Non-DEL regions for training set

bcftools view -i 'SVTYPE="DEL"' $VCF \
    | bcftools query -f '%CHROM\t%POS\t%INFO/END[\t%SAMPLE,%GT]\n' \ 
    | python sample_del.py > $OUT/del.sample.bed # from git repo
  • Output to bed file annotated with sample and genotype (REF, HET, ALT)
  • TODO make the variables command line args with flags

Generate training images

gen_img.sh

cat $BED_FILE | gargs -p $PROCESSES \
 "bash gen_img.sh \\
     --chrom {0} --start {1} --end {2} --sample {3} --genotype {4} \\
     --fasta $FASTA \\
     --bam-list $CRAM_LIST \\
     --bam-dir $CRAM_INDEX_DIR \\
     --out-dir $OUT_DIR/imgs"
  • Use gargs to parse the contents of the Training regions and feed to gen_img.sh
  • $CRAM_LIST (they’re actually crams) can be absolute file paths or urls (must download indices though)
  • $CRAM_DIR is the directory containing the CRAM indices
  • TODO upload the cram list of urls to github
  • TODO check if I used a different GRCh38 from “full_analysis_set_plus_decoy_hla”

Crop images to remove the surrounding text/axes

bash crop.sh \
    --processes $NUM_PROCESSES \
    --data-dir $DATA_DIR
  • Where $DATA_DIR is the parent directory containing the img/ directory from the previous step
  • Cropped images will be placed in $DATA_DIR/crop

Split into train/val sets (file listings)

  • TODO

Training Procedure

We use the run.py script to train a new model

usage: run.py train [-h] [--batch-size BATCH_SIZE] [--epochs EPOCHS]
                    [--model-type MODEL_TYPE] --data-dir DATA_DIR
                    [--learning-rate LR] [--momentum MOMENTUM]
                    [--label-smoothing LABEL_SMOOTHING] [--save-to SAVE_TO]

optional arguments:
  -h, --help            show this help message and exit
  --batch-size BATCH_SIZE, -b BATCH_SIZE
                        Number of images to feed to model at a time. (default:
                        80)
  --epochs EPOCHS, -e EPOCHS
                        Max number of epochs to train model. (default: 100)
  --model-type MODEL_TYPE, -mt MODEL_TYPE
                        Type of model to train. (default: CNN)
  --data-dir DATA_DIR, -d DATA_DIR
                        Root directory of the training data. (default: None)
  --learning-rate LR, -lr LR
                        Learning rate for optimizer. (default: 0.0001)
  --momentum MOMENTUM, -mom MOMENTUM
                        Momentum term in SGD optimizer. (default: 0.9)
  --label-smoothing LABEL_SMOOTHING, -ls LABEL_SMOOTHING
                        Strength of label smoothing (0-1). (default: 0.0)
  --save-to SAVE_TO, -s SAVE_TO
                        filename if you want to save your trained model.
                        (default: None)

Test data workflow

HG002 data processing

Get Cram/index

  • TODO

Genotype with smoove

  • TODO

Get VCF and tier 1 regions bed

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.bed

Filter DELs with bcftools view

HG00514, HG00733, NA19240 data processing

Get Crams/indices

HG00514

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/data/CHS/HG00514/high_cov_alignment/

HG00733

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/data/PUR/HG00733/high_cov_alignment/

NA19240

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/data/YRI/NA19240/high_cov_alignment/

Get truth set VCFs/indices

ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/genotype/nstd152

Filter DELs with bcftools view and Fix VCFs

  • Remove length 0 contigs (causes problems with truvari otherwise)
  • Run fix_vcf.py script to correct SVLEN
    • For some reason the %INFO/END field is just start + 1 so we need to use SVLEN to calculate the true end.
bcftools view -i 'SVTYPE="DEL"' $TRUTH_SET_VCF \
    | grep -v "length=0>" \
    | python fix_vcf.py \
    | bgzip -c > $FIXED_TRUTH_SET
tabix $FIXED_TRUTH_SET

Genotype with smoove (annotated with duphold) to get baseline VCF

Use the following command

smoove call \
    --outdir $OUT_DIR \
    --processes $PROCESSES \
    --name $SAMPLE_NAME \ # eg HG00514
    --exclude $BED_DIR/exclude.cnvnator_100bp.GRCh38.20170403.bed
    --fasta $FASTA # 
    --removepr \
    --support $SUPPORT \
    --genotype \
    --duphold \
    $CRAM_PATH

You can get the exclude regions bed for GRCh38 from here

Use GRCh38_full_analysis_set_plus_decoy_hla.fa reference genome

fasta index

s3://1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa s3://1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fai

Generate images

From smoove generated VCF, extract just the dels

bcftools view -i 'SVTYPE="DEL"' $SAMPLE-smoove.genotyped.vcf.gz \
    | bgzip -c > $SAMPLE-smoove.genotyped.del.vcf.gz

from VCF to bed, pipe to gargs, call gen_img.sh

bcftools query -f '%CHROM\t%POS\t%INFO/END[\t%SAMPLE\t%GT]\n' \
    $SAMPLE-smoove.genotyped.del.vcf.gz  | gargs -p $PROCESSES \
    "bash gen_img.sh \\
        --chrom {0} --start {1} --end {2} --sample $SAMPLE --genotype DEL \\
        --fasta $FASTA \\
        --bam-dir $PATH_TO_CRAM \\
        --out-dir $OUT_DIR/imgs"

Crop images

bash crop.sh \
    --processes $NUM_PROCESSES \
    --data-dir $DATA_DIR
  • Where $DATA_DIR is the parent directory containing the img/ directory from the previous step
  • Cropped images will be placed in $DATA_DIR/crop

Create file listing for images

cd $SAMPLE_DIR # parent directory of cropped images
find $(pwd)/crop/*.png > $IMAGE_LIST

Filter using duphold annotations

bcftools view -i 'DHFFC<0.7' $BASELINE_VCF | bgzip -c > dhffc.lt.0.7.vcf.gz
tabix dhffc.lt.0.7.vcf.gz

Filter with CNN model

bash create_test_vcfs.sh \
    --model-path $MODEL_PATH \
    --data-list $IMAGE_LIST \
    --vcf $BASELINE_VCF \ # i.e. the smoove genotyped vcf
    --out-dir $OUT_DIR

Run truvari on baseline, duphold and CNN VCF

bash truvari.sh \
    --comp-vcf $COMP_VCF \
    --base-vcf $TRUTH_SET_VCF \
    --reference $REF \
    --out-dir $OUT_DIR

Results/Analysis

Truvari statistics

SVLen >= 300

HG002 (Ashkenazim) with tier 1SmooveDHFFCCNNmantaDHFFCCNN
TP149614881489
FP833362
FN276284283
Precision0.9470.9780.96
Recall0.8440.8400.840
F10.8930.9040.896
FP Intersection
HG002 (Ashkenazim) without tier 1SmooveDHFFCCNNmantaDHFFCCNN
TP178717641758170816871706
FP452276273265175187
FN893916922972993981
Precision0.7980.8650.8660.8660.9060.901
Recall0.6670.6580.6560.6370.6290.634
F10.7270.7470.7460.7340.74280.744
FP Intersection
HG00514 (Han Chinese)LumpyDHFFCCNNmantaDHFFCCNN
TP186018371803177917591751
FP860596372502328221
FN858881915939959967
Precision0.6840.7550.8290.7800.8430.888
Recall0.6840.6760.6630.6540.6470.644
F10.6840.7130.7370.7120.7310.747
HG00514 (Sensitive)LumpyDHFFCCNN
TP187518511814
FP1157761477
FN843867904
Precision0.6180.7090.792
Recall0.6900.6810.667
F10.6520.6950.724
HG00733 (Puerto Rican)SmooveDHFFCCNNmantaDHFFCCNN
TP123612161181177417531736
FP1066760517455306204
FN1505152515609679881005
Precision0.5370.6150.6960.7960.8510.895
Recall0.4510.4430.4310.6470.6400.633
F10.4900.5200.5320.7140.7300.742
HG00733 (Sensitive)LumpyDHFFCCNN
TP127712551219
FP1422968676
FN146414861522
Precision0.4730.5640.643
Recall0.4660.4600.445
F10.4690.5060.526
NA19240 (Yoruban)LumpyDHFFCCNNmantaDHFFCCNN
TP149414701414206720542019
FP1070801628520359272
FN171117351791113811511186
Precision0.5830.6470.6920.7990.8510.881
Recall0.5660.4590.4410.6450.6410.630
F10.5180.5370.5490.7140.7310.735
NA19240 (Sensitive)LumpyDHFFCCNN
TP154215181466
FP14271023804
FN166316871739
Precision0.5190.5970.646
Recall0.4810.4740.457
F10.5000.5280.536

PR Curves

HG002

./figures/HG002-notier1-pr.png

HG00514

./figures/HG00514-pr.png

HG00733

./figures/HG00733-pr.png

NA19240

./figures/NA19240-pr.png

Duphold False Positive DHFFC score histograms

HG002

./figures/HG002-duphold-fp-DHFFC-dist.png

HG00514

./figures/HG00514-duphold-fp-DHFFC-dist.png

HG00733

./figures/HG00733-duphold-fp-DHFFC-dist.png

NA19240

./figures/NA19240-duphold-fp-DHFFC-dist.png

CNN False Positive 0/0 prediction score histograms

HG002

./figures/HG002-notier1-pred-dist.png

HG00514

./figures/HG00514-pred-dist.png

HG00733

./figures/HG00733-pred-dist.png

NA19240

./figures/NA19240-pred-dist.png

Agreement between SVplaudit and CNN using NA12878 from 1000 genomes

  • TODO write this up in more detail
    • ie. data sources, and how to generate images, etc.

Agreement with SVplaudit

bedtools intersect -wb -f 1.0 -r -a $svplaudit_bed -b $pred_bed \
| python3 svplaudit_agreement.py
  • For regions with SV plaudit scores < 0.2 and > 0.8 we have ~97% agreement
  • If we don’t filter out unambiguous regions then we have ~93% agreement.

Agreement with original 1000genomes calls

bcftools view -i 'SVTYPE="DEL"' $NA12878_callset \
| bcftools query -f '%CHROM\t%POS\t%INFO/END\n' \
| bedtools intersect -wb -f 1.0 -r -a stdin -b $pred_bed \
| python3 vcf_agreement.py
  • For the original callset we have ~93% agreement.

TODO list

Run lumpy/manta with increased sensitivity

Get the baseline performances with truvari

Create images and run model on new dataset

Get model performance on the increased sensitivity models

Run model on sensitive dataset without SVTYPER

Analyze the False positives/False negatives

Visualize the FP/FN along with their predictions.

Does anything like that appear in the training set.

If not, then how can we modify the training set to include such data

Fix Chaisson VCFs so that they can work on newer truvari

Make sure results haven’t changed

Take a closer look at the types of FPs made by model/duphold

# of definite FP

# of ambiguous FP

# of FP that look like TP

Of these, how many are because they are missing from the truth sets

And how many are due to imprecise breakpoints

  • Can this be rectified?

Old TODO

Size distribution of duphold/CNN fp/fn

  • Just analyze the fn’s made by duphold/CNN but not smoove

Size distributions over the truth sets

Intersection/difference stats of duphold/CNN/smoove tp/fp/fn

  • ie. does duphold/CNN make largely the same or different mistakes

Grad cam visualizations of the set of tp/fp/fn

  • Just make a representative sample of true positives
  • Again, for fn’s just do the ones not made by smoove

dist of score values for tp/fp (CNN and duphold)

Duphold fp DHFFC score distribution

CNN fp DHFFC score distribution

CNN fp prediction score distribution

HG002 with/without tier 1 regions

figure if tier 1 regions can be applied to other genomes

  • if yes then do it. (it can’t)

PR-curves

label ancestry of genomes

Train another low coverage model later and do some testing

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 59.0%
  • Python 41.0%