Specific Insertions Detector (SID) is a program to detect non-reference human transposon insertion. It is compiled in Perl and includes two steps, discordant reads detection and reads clustering. Generally, the first step collects informative reads and generates other necessary files, while the second step discovers the specific insertion sites and exports the final results into a plain text.
In the first step, SID usually consumes less than 1GB memory. While its peak usage of memory could up to 30GB with one thread using 30-70X whole genome sequencing data of human in the second step.
- Samtools v1.0 or later ( earlier versions may result in some mistakes in the step of discordant reads detection)
- Perl Module: Bio::DB::Sam, Statistics::Descriptive, threads::shared, IO::File
- BLAST (v2.2.25 or later)
- BAM file: paired-end sequencing data aligned by BWA aln. At least containing XT/X1/MD tags.
- FASTA file: The sequence of non-reference TEs and human reference genome. Both of them need BLAST indices.
We are preparing for a paper using this program.
- 01discordant_v2.pl: v2.0
- 02cluster.pl: v1.0
- The program allows putting 1 or more BAM files in a BAM_list file (plain text) as input.
- You must make a BLAST index for the TE sequence, and put it in the same directory with TE FASTA file.
- When running this program, the input BAM file should not remove duplicates beforehand, or it may stop running accidentally.
- The parameter of '-run' cannot be used at present, and we will fix it soon.
- Please export the path of BLAST and Samtools to .bashrc before running SID.
- Users need install above-mentioned dependent softwares and modules before running SID. There is no need to install SID.
I uploaded an example of how to use SID and the demo input and output of SID to Google drive. Please contact me if any questions: [email protected] or [email protected]
https://drive.google.com/file/d/0B-5j9b_mSd_GaEI1eUxvamh2SWM/view?usp=sharing
Of note, "Test.sorted.bam" in "example.zip" is just an example to illustrate the format of input BAM files, it cannot be used directly. The real input BAM files are available at GigaDB ftp site ftp://penguin.genomics.cn/pub/10.5524/100001_101000/100318/Alignment/ Please merge these split BAM files into one file and sort it before running SID.