Skip to content

Commit

Permalink
update main(version issue)
Browse files Browse the repository at this point in the history
  • Loading branch information
Chong Chu committed May 4, 2022
1 parent 1239488 commit dfc245b
Showing 1 changed file with 5 additions and 201 deletions.
206 changes: 5 additions & 201 deletions xtea/x_TEA_main.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,130 +2,20 @@
##@@author: Simon (Chong) Chu, DBMI, Harvard Medical School
##@@contact: [email protected]
##
'''
Upgrade 11/04/2018
1. Put all the pubic shared variables to a single file global_variables.py
2. For clip and disc have two different mapping-quality cutoff now
3. Keep the short poly-A clipped reads in the "clip" step
4. Change the align clipped reads to repeat copies to "two stages" alignment
5. Popen.communicate() has a potential bug when the "stdout" file is large. Now change it redirect to a file,
instead of stdout
6. Add the gene annotation information to each candidate
####11/12/2018 --solved on 11/14/2018
Bug need to fix: when have several bams for one individual, the pub_folder link may have the same file name,
this is not correct.
####
####@@CC 01/03/2018
Todo List: 1. clip step requires lots of memory. 150G for 10 cores. This should be improved.
Update: 01/09/2018: The issue is caused at the "ClipReadInfo::cnt_clip_part_aligned_to_rep()" function.
Solution:
1)Skip the load all into memory step, instead only load those has "left(right) realigned" ones. This is
the current version used.
2)For the re-aligned sam file, parse out the chrm, positions, and write to a file, and sort the file,
then, count the number of supported reads. This version should be more memory efficient, but hasn't been
fully tested.
####
09/10/2018
Todo List:
1. For orphan (or partial) transduction events:
1) In "clip" step, change "mate-in-rep" to "mate-is-discordant"
3) In "disc" step, change "mate-in-rep" to "mate-is-discordant", then split to two outputs:
3. output to vcf file
10/10/2018
Trio checking module:
Need to check whether the clipped parts are same or not!
11/06/2018:
Add a coverage estimation module. (need to sum up the coverage of different bams) --To do list
Set an automatic cutoff setting module that associated with read depth, e.g. (solved 01/05/2019)
30-40X: clip:3, disc:5
50-60: clip:4, disc:6
70-100: clip:5, disc:8
Also, need to set the maximum coverage cutoff (current one in filtering step, default set as 8*30=240X)
Similarly, check_disc_sample_consistency(), should use an arrary of cutoff value.
11/20/2018 Two major modules now affact the speed a lot, especiallly when there lots of clipped candidate sites (for 10X):
1. Align the clipped reads to repeat copies. (Already have been improved, but still lots of rooms)
2. Check discordant reads (second step) step, current module is split by chrm, and still have room to improve!!!
To do list:
12/11/2018 (solved)
Check the phred score for the clipped parts, and if the score is low, then discard the clipped part
#Q: how to set the cutoff for this value? By default set as 15
12/12/2018 (solved on 12/31/18)
Collect all the background information
#nearby un-related clipped reads
#nearby un-related or low mapping quality discordant reads
#fully mapped reads at the breakpoint
#concordant reads near the breakpoint
##reads contain small indels or lots of mismatches.... (how to calc this) ???
##
12/26/2018 (solved on 12/31/18)
Revise: polyA part now changed to only check sub-clipped reads
####
12/31/18
Add one option for cleaning tmp files.
01/31/2019
Potential issue:
1. When reads are long say: 250bp, but the insertion is short like 200bp, then there will be no/small-number of discordant reads.
2. This need to be solved when adding the long reads module.
02/28/2019
Improvement:
1. Define high confident regions (discordant reads aligned to these regions will not be counted)
2. Add repeatmasker annotation in the final output
3. If one site is called as both Alu and SVA (high confident), then select as SVA
4. When checking whether neighor region is coverage island or not, count all reads, not only those with high mapq.
5. Check the TSD assembly? To do further filtering.
For those have two side clipped reads support, left/right breakpoint is supported by left/right clipped reads only
6. Check the standard derivation for breakpoint clipped reads (maybe also the discordant reads?), especially for 10X data
03/24/2019
Improvement:
For cram format, if we don't need to parse quality, seq fields, then some species process can improve the speed:
See here: https://brentp.github.io/post/cram-speed/
bam_list.append(pysam.AlignmentFile(b,
mode='rc',reference_filename=ref_fasta,format_options=["required_fields=7167"]))
https://gist.github.com/brentp/213f214dd5677dd447150e62e2e360f5
04/17/19 (done, need to collect further information )
Collect the left-disc-rc/non-rc, and right-dis-rc/non-rc features and save them in cns file;
Note: these are from aligned ones, also need to check those unaligned ones
####
Also need to collect the: direction of the background ones !!!!!
####

####
04/17/19 Add the post filtering module (done)
####
04/15/21 Add the de novo mode
####
'''
####
import os
from shutil import copyfile
import global_values
from x_TEI_locator import *
#from x_TEI_source_tracer import *
from x_local_assembly import *
from x_intermediate_sites import *
from x_reference import *
from x_clip_disc_filter import *
from x_somatic_calling import *
#from x_analysis import *
from optparse import OptionParser
from x_reads_collection import *
from x_mutation import *
from x_TE_associated_sv import *
from x_gene_annotation import *
from x_genotype_feature import *
from x_basic_info import *
Expand All @@ -134,13 +24,10 @@
from x_mosaic_calling import *
from x_joint_calling import *
from x_igv import *
from x_BamSnap import *
from x_gvcf import *
from x_genotype_classify_sklearn import *
from x_genotype_classify import *
from x_orphan_transduction import *
from x_denovo_calling import *
from x_short_read_MEI_assembly import *
#
####
##parse the options
Expand Down Expand Up @@ -247,21 +134,12 @@ def parse_option():
parser.add_option("--igv",
action="store_true", dest="igv", default=False,
help="Prepare screenshot command for given sites")
parser.add_option("--bamsnap",
action="store_true", dest="bamsnap", default=False,
help="Prepare screenshot command for given sites")
parser.add_option("--force",
action="store_true", dest="force", default=False,
help="Force to start from the very beginning")
parser.add_option("--case_control",
action="store_true", dest="case_control", default=False,
help="case control mode")
parser.add_option("--denovo",
action="store_true", dest="denovo", default=False,
help="de novo insertion calling from trio")#
parser.add_option("--hard",
action="store_true", dest="hard", default=False,
help="This is hard-cut for fitering out coverage abnormal candidates")
parser.add_option("--tumor",
action="store_true", dest="tumor", default=False,
help="Working on tumor samples")
Expand Down Expand Up @@ -302,9 +180,7 @@ def parse_option():
parser.add_option("--single_sample",
action="store_true", dest="single_sample", default=False,
help="For single sample (like igv screenshot)")
parser.add_option("--sr_asm",
action="store_true", dest="short_read_assembly", default=False,
help="TE insertion assembly from short reads")

#
####
parser.add_option("-i", "--input", dest="input", default="",
Expand Down Expand Up @@ -914,46 +790,6 @@ def adjust_cutoff_tumor(ncutoff=-1, i_adjust=1):
ccm = CaseControlMode(sf_ref, s_working_folder, n_jobs)
ccm.parse_high_confident_somatic(sf_candidate_list, sf_raw_somatic, sf_output)
####
####from trio or quads to call de novo insertion
#####this is assume we have run on the proband already
elif options.denovo: # case-control mode to call somatic events
sf_candidate_cns = options.input # this is the list called from case
sf_candidate_cns2 = options.input2 # this is the list called from case
sf_output = options.output
s_working_folder = options.wfolder

sf_bam_list = options.bam # this is the control bam file list (each bam in one line)
sf_ref = options.ref # reference genome
n_jobs = options.cores
nclip_cutoff = options.cliprep # this is the sum of left and right clipped reads
ndisc_cutoff = options.ndisc # each sample should have at least this number of discordant reads
sf_rep_cns = options.reference ####repeat copies/cns here
sf_flank = options.fflank # this is the flanking region
i_flk_len = options.flklen

b_force=True
rcd=None
basic_rcd=None
n_polyA_cutoff=0
if b_automatic==True:
rcd, basic_rcd=automatic_gnrt_parameters_case_control(sf_bam_list, sf_ref, s_working_folder, n_jobs, b_force)
nclip_cutoff=rcd[0]
ndisc_cutoff=rcd[1]
n_polyA_cutoff=rcd[2]

ccm=DenovoMode(sf_ref, s_working_folder, n_jobs)
rlth = basic_rcd[1] # read length
mean_is = basic_rcd[2] # mean insert size
global_values.set_insert_size(int(mean_is))
std_var = basic_rcd[3] # standard derivation
max_is = int(mean_is + 3 * std_var) + int(rlth)
extnd = max_is
bin_size = 50000000 # block size for parallelization
print("clip,disc,polyA-cutoff is ({0}, {1}, {2})".format(nclip_cutoff, ndisc_cutoff, n_polyA_cutoff))
n_polyA_cutoff=ndisc_cutoff #if both sides have more than cutoff polyA, then filter out
ccm.call_denovo_TE_insertions(sf_bam_list, sf_candidate_cns, sf_candidate_cns2, extnd, nclip_cutoff, ndisc_cutoff,
n_polyA_cutoff, sf_rep_cns, sf_flank, i_flk_len, bin_size, sf_output)
####

####
elif options.mosaic: # this is only for normal illumina data
Expand Down Expand Up @@ -1148,14 +984,10 @@ def adjust_cutoff_tumor(ncutoff=-1, i_adjust=1):
i_extnd = options.extend #"-e", "--extend"
b_single_sample=options.single_sample

if options.bamsnap==True:
x_bamsnap=XBamSnap()
x_bamsnap.prepare_bamsnap_scripts_multi_bams(sf_sites, sf_bam_list, s_screenshot_folder, i_extnd, sf_gnm, sf_out)
else:
x_igv = XIGV()#
if b_single_sample==True:
sf_sites=x_igv.gnrt_sites_single_sample(sf_sites, sf_bam_list)
x_igv.prepare_igv_scripts_multi_bams(sf_sites, sf_bam_list, s_screenshot_folder, i_extnd, sf_gnm, sf_out)
x_igv = XIGV()#
if b_single_sample==True:
sf_sites=x_igv.gnrt_sites_single_sample(sf_sites, sf_bam_list)
x_igv.prepare_igv_scripts_multi_bams(sf_sites, sf_bam_list, s_screenshot_folder, i_extnd, sf_gnm, sf_out)

####
elif options.collect: # collect the reads for each candidate site
Expand Down Expand Up @@ -1222,19 +1054,6 @@ def adjust_cutoff_tumor(ncutoff=-1, i_adjust=1):
# xlasm.assemble_all_TEIs_slurm_cluster(sf_candidate_list)
xlasm.assemble_all_phased_TEIs_slurm_cluster(sf_candidate_list)

elif options.short_read_assembly:#this is for insertion assembly from short reads --sr_asm
sf_candidate_list = options.input
sf_bam_list = options.bam
sf_ref = options.ref ###reference genome, some cram file require this file to open
s_working_folder = options.wfolder

n_jobs = options.cores
sr_mei_asm=ShortRead_MEI_ASM(sf_candidate_list, sf_bam_list, s_working_folder, n_jobs)
iextnd = 400 ###for each site, re-collect reads in range [-iextnd, iextnd], this around ins +- 3*derivation
bin_size = 50000000 # block size for parallelization
sf_ins_asm = options.output
sr_mei_asm.asm_MEI_PE_reads(sf_ref, iextnd, bin_size, sf_ins_asm)##

####
####
elif options.map:####align the asm to reference genome
Expand Down Expand Up @@ -1276,12 +1095,6 @@ def adjust_cutoff_tumor(ncutoff=-1, i_adjust=1):
xctg = XTEContig(s_working_folder, n_jobs)
xctg.validate_TEI_from_phased_asm_algnmt(sf_sites, flank_length, f_map_cutoff, i_slack, sf_repeat_copies,
sf_keep_sites)
elif options.sv:
sf_sites = options.input
sf_dels=options.output
te_del=TEDeletion()
m_del=te_del.call_del_matched_brkpnts(sf_sites)
te_del.output_del(m_del, sf_dels)

elif options.collect_clip:#collect the clipped reads for the sample
sf_bam_list = options.input
Expand All @@ -1296,12 +1109,3 @@ def adjust_cutoff_tumor(ncutoff=-1, i_adjust=1):
wfolder_pub_clip = options.cwfolder # public clip folder
# collect the clipped reads only
tem_locator.collect_all_clipped_from_multiple_alignmts(sf_annotation, b_se, s_clip_wfolder, wfolder_pub_clip)

elif options.visualization: ##show the shared barcode heatmap between the TEI site and source region
sf_matrix = options.input
####classify the TE insertions to normal ones and complex ones:
# for insertion with deletion: the other clipped part will not be aligned to a repeat region, but the other breakpoint
# while for normal ones, the other clipped part will be aligned to a repeat region
####2. python x_TEA_main.py -T -i ${VCF} -r ${ANNOTATION} -b ${BAM} -d ${BARCODE_BAM} -p ${TMP} -n 1
####4.
####

0 comments on commit dfc245b

Please sign in to comment.