The agotron_detector toolkit identifies and quantifies agotrons in Ago CLIPseq datasets.
bowtie2 (tested with v2.2.8)
samtools (tested with v1.3)
python (tested with v2.7.11) including modules: pysam, numpy, and mySQL
R (tested with v3.2.5) including packages: ggplot2, ggregex, dplyr, tidyr, and optparse
The agotron_detector repository basically contains three scripts that should be run sequencially:
(1) UCSC_intron_retriever.py, a python script to extract short introns from the UCSC mySQL server
Usage:
UCSC_intron_retriever.py [ARGUMENTS] > [OUTPUT]
Options:
-db <string> MySQL database (default='hg19')
-table <string> MySQL table (default='refGene')
-max <int> Maximum intron length (default=150)
-min <int> Minimum intron length (default=50)
Alternatively, use UCSC_mirna_retriever.py or tophat_intron_retriever.py to retrieve RefSeq annotated miRNA coordinates or intron coordinates from tophat-produced junctions.bed.
(2) analyzer.py, a python script to intersect mapped reads with coordinates of interest and output agotron-relevant features
Usage:
analyzer.py [ARGUMENTS] < [INPUT] > [OUTPUT]
Required arguments:
-g <file> Path to reference genome fastafile, must be indexed with samtools faidx
Optional arguments:
-f <files…> Input bam-files (default=*.bam)
-c <file> Filename for coverage output (if empty, no coverage file is produced)
(default='coverage.txt')
-tr <int> RPMM (reads per mapped million) expression threshold for output
(default = 5)
-ts <int/'all'> How many samples to meet RPMM expression threshold. For all samples, type ‘all’
(default = 2)
-m <float> Tolerance for reads mapping partly outside locus
(default=0.1, e.g. 10% of the reads is allowed to map outside locus)
-q <int> Threshold for mapping quality
(default=13)
-a <int> Add <int> flanking nucleotides (up and downstream) to the loci sequence output
(default=10)
(3) annotater.R, an R script that annotates agotron and outputs a few different plots
Usage:
annotater.R [ARGUMENTS] < [INPUT]
optional arguments:
-c <file> Input coverage file (the -c output from analyzer.py)
(default='coverage.txt')
-p <string> Prefix used in output files
(default='agotron')
agotron definition:
-m <int> Threshold for median read length
(default=30)
-h <float> Minimum fraction of reads with distance (-d) between 5’end of read and 5’end of locus
(default=0.7)
-d <int> Maximum distance allowed from predominant 5’end of reads to 5’end of locus
(default=1)
Clone the repository and run the example script, all_commands.sh. This:
- Downloads and prepares reference genome (hg19)
- Downloads and trims Ago CLIPseq dataset (GSE78059)
- Maps dataset to the reference genome using bowtie2
- Intersects the mapped reads (bam-files) with annotated short introns to detect and annotate agotrons.
sh all_commands.sh
To annotate agotrons in mapped data (use -db and -g options to specify the reference genome used):
python UCSC_intron_retriever.py -db hg19 | python analyzer.py -g /path/to/hg19.fa -f /path/to/*.bam | Rscript annotater.R
Hansen TB. Detecting agotrons in Ago CLIPseq data. MiMB, 2017, submitted
Copyright (C) 2017 ncRNALab. See the LICENSE file for license rights and limitations (MIT).