We recommend using BASH to run pipelines.
- Define parameters in command line argument.
For general use, use the following command line: (for PE data set)
$ bds atac.bds -fastq1_1 [READ1] -fastq1_2 [READ2] -bwt2_idx [BOWTIE2_INDEX] \
-gensz [GENOMESIZE; hs for human, mm for mouse] -chrsz [CHR_SIZES_FILE] -adapter [ADAPTER_TO_BE_TRIMMED]
If your fastqs are already trimmed, add the following to the command line to skip trimming stage.
-trimmed_fastq
IMPORTANT! If your data set is SINGLE ENDED add the following to the command line, otherwise the pipeline works for PE by default.
-se
For ATAQC, define the following parameters. See help ($ bds atac.bds
) for description of all parameters. Even though you don't use a species file -species_file
, you need to specify a species name for ATAQC. You will get an ATAQC report per replicate. ATAQC is avaible only when you start a pipeline with FASTQ inputs.
-species [hg19, mm9 or ...] -tss_enrich [] -ref_fa [] -blacklist [] -dnase [] -prom [] -enh [] -reg2map [] -roadmap_meta []
If you want to just align your data (no peak calling or further steps like IDR).
-align
If you don't want ATAQC, add the following to command line.
-no_ataqc
If you have just one replicate (PE), define fastqs with -fastq[REP_ID]_[PAIR_ID]
.
-fastq1_1 [READ_PAIR1] -fastq1_2 [READ_PAIR2] \
For multiple replicates (PE), define fastqs with -fastq[REP_ID]_[PAIR_ID]
. Add -fastq[]_[] for each replicate and pair to the command line:replicates.
-fastq1_1 [READ_REP1_PAIR1] -fastq1_2 [READ_REP1_PAIR2] -fastq2_1 [READ_REP2_PAIR1] -fastq2_2 [READ_REP2_PAIR2] ...
For multiple replicates (SE), define fastqs with -fastq[REP_ID]
:
-se -fastq1 [READ_REP1] -fastq2 [READ_REP2] ...
You can start from bam files. There are two kinds of bam files (raw or deduped) and you need to explicitly choose between raw bam (bam) and deduped one (filt_bam) with -input [BAM_TYPE]
. Don't forget to add -se
if they are not paired end (PE). For raw bams,
-bam1 [RAW_BAM_REP1] -bam2 [RWA_BAM_REP1] ...
For deduped (filtered) bams:
-filt_bam1 [FILT_BAM_REP1] -filt_bam2 [FILT_BAM_REP1] ...
For tagaligns (non-tn5-shifted):
-tag1 [TAGALIGN_REP1] -tag2 [TAGALIGN_REP2] ...
To subsample beds (tagaligns) add the following to the command line. This is different from subsampling for cross-corr. analysis. Peaks will be called with subsampled tagaligns.
-subsample [NO_READS_TO_SUBSAMPLE]
To change # of lines to subsample for cross-corr. analysis. This will not affect tasks downstream (peak calling and IDR).
-nreads [NO_READS_TO_SUBSAMPLE]
To disable pseudo replicate generation, add the following. By default, peak calling and IDR will be done for true replicates and pseudo replicates, but if you have -true_rep
in the command line, you will also get IDR on true replicates only.
-true_rep
IDR analysis is included in the pipeline by default. For better IDR QC, define blacklist idr (for hg19
, http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz). For other genomes, https://sites.google.com/site/anshulkundaje/projects/blacklists
-blacklist [BLACKLIST_BED]
If you don't want IDR, add the following:
-no_idr
To list all parameters and default values for them,
$ bds atac.bds
- Define parameters in configuration file. Key names in a configruation file are identical to parameter names on command line.
$ bds atac.bds [CONF_FILE]
$ cat [CONF_FILE]
fastq1_1= [READ1]
fastq1_2= [READ2]
...
For Kundaje clusters and SCG3/4, do not define any species specific parameters like -tss_enrich
, -ref_fa
, -blacklist
, -dnase
, -prom
, -enh
, -reg2map
, -roadmap_meta
, -bwt2_idx
, -gensz
, -chrsz
. They are already defined in ./species/kundaje.env
and ./species/scg3.env
. You just need to define the name of species -species [SPECIES; hg19, mm9, ...]
.
$ bds atac.bds -species [hg19 or mm9] ...
See details here
For completely serialized jobs, add the following. Individual jobs can still go multi-threaded.
-no_par
You can set up a limit for total # threads. Total # threads used by a pipeline will not exceed this limit. By default, it's 16 on SCG3/4 and 8 on Kundaje clusters and others.
-nth [MAX_TOTAL_NO_THREADS]
A pipeline automatically distributes [MAX_TOTAL_NO_THREADS]
threads for jobs according to the corresponding input file sizes. For example of two fastqs (1GB and 2GB) with -nth 6
, 2 and 4 threads are allocated for aligning 1GB and 2GB fastqs, respectively. The same policy applies to other multi-threaded tasks like deduping and peak calling.
However, all multi-threaded tasks (like bwa, bowtie2, spp and macs2) still have their own max. memory (-mem_APPNAME [MEM_APP]
) and walltime (-wt_APPNAME [WALLTIME_APP]
) settings. Max. memory is NOT PER CPU. On Kundaje cluster (with SGE flag activated bds -s sge chipseq.bds ...
) or on SCG3/4, the actual shell command submitted by BDS for each task is like the following:
qsub -pe shm [NTH_ALLOCATED_FOR_APP] -h_vmem=[MEM_APP]/[NTH_ALLOCATED_FOR_APP] -h_rt=[WALLTIME_APP] ...
This ensures that total memory reserved for a cluster job equals to [MEM_APP]
.
./utils/bds_scr
is a BASH script to create a detached screen for a BDS script and redirect stdout/stderr to a log file [LOG_FILE_NAME]
. If a log file already exists, stdout/stderr will be appended to it. Monitor your pipeline with tail -f [LOG_FILE_NAME]
. The only difference between bds_scr
and bds
is that you have [SCR_NAME] [LOG_FILE_NAME]
between bds
command and its parameters.
bds_scr [SCR_NAME] [LOG_FILE_NAME] atac.bds ...
You can skip [LOG_FILE_NAME]
then a log file [SCR_NAME].log
will be generated on the working directory.
bds_scr [SCR_NAME] atac.bds ...
You can also add any BDS parameters (like -dryRun
, -d
and -s
). The following example is for running a pipeline on Sun Grid Engine.
bds_scr [SCR_NAME] [LOG_FILE_NAME] -s sge atac.bds ...
Once the pipeline run is done, the screen will be automatically closed. To kill a pipeline manually while it's running:
screen -X -S [SCR_NAME] quit
Or use ./utils/kill_scr
:
kill_scr [SCR_NAME]
If you stop a BDS pipeline with Ctrl+C
while calling peaks with spp
. Temporary files generated by Rscript
are not removed and they still exist on $TMP
(or /tmp
if not explicitly exported). You need to manually remove them.
https://github.com/kundajelab/TF_chipseq_pipeline/blob/master/README_CODE.md
For python2 (python 2.x >= 2.7) and R-3.x, here For python3, here
See more troubleshootings here
- pysam backward compatibility issue
ATAQC currently does not work with pysam >= 0.9. Lower it to 0.8.3.
Traceback (most recent call last):
File "/users/leepc12/code/bds_atac/ataqc/run_ataqc.py", line 1303, in <module>
main()
File "/users/leepc12/code/bds_atac/ataqc/run_ataqc.py", line 1120, in main
chr_m_reads, fraction_chr_m = get_chr_m(COORDSORT_BAM)
File "/users/leepc12/code/bds_atac/ataqc/run_ataqc.py", line 160, in get_chr_m
tot_reads += int(chrom_stats[2])
IndexError: list index out of range
- samtools ncurses bug
Prepend a directory for libncurses.so.5
to LD_LIBRARY_PATH
. See install_dependencies.sh
for solution.
samtools: symbol lookup error: /lib/x86_64-linux-gnu/libncurses.so.5: undefined symbol: _nc_putchar
- Error: could not find environment: bds_atac
Unload any Anaconda Python modules. Remove locally installed Anaconda Python from your $PATH
.
- Jin wook Lee - PhD Student, Mechanical Engineering Dept., Stanford University
- Chuan Sheng Foo - PhD Student, Computer Science Dept., Stanford University
- Daniel Kim - MD/PhD Student, Biomedical Informatics Program, Stanford University
- Anshul Kundaje - Assistant Professor, Dept. of Genetics, Stanford University