Skip to content

kundajelab/atac_dnase_pipelines

 
 

Repository files navigation

ATAC Seq Pipeline

Installation instruction

General

Kundaje lab

SCG3/4

Usage

We recommend using BASH to run pipelines.

  1. Define parameters in command line argument.

For general use, use the following command line: (for PE data set)

$ bds atac.bds -fastq1_1 [READ1] -fastq1_2 [READ2] -bwt2_idx [BOWTIE2_INDEX] \
-gensz [GENOMESIZE; hs for human, mm for mouse] -chrsz [CHR_SIZES_FILE] -adapter [ADAPTER_TO_BE_TRIMMED]

If your fastqs are already trimmed, add the following to the command line to skip trimming stage.

-trimmed_fastq

IMPORTANT! If your data set is SINGLE ENDED add the following to the command line, otherwise the pipeline works for PE by default.

-se 

For ATAQC, define the following parameters. See help ($ bds atac.bds) for description of all parameters. Even though you don't use a species file -species_file, you need to specify a species name for ATAQC. You will get an ATAQC report per replicate. ATAQC is avaible only when you start a pipeline with FASTQ inputs.

-species [hg19, mm9 or ...] -tss_enrich [] -ref_fa [] -blacklist [] -dnase [] -prom [] -enh [] -reg2map [] -roadmap_meta []

If you want to just align your data (no peak calling or further steps like IDR).

-align

If you don't want ATAQC, add the following to command line.

-no_ataqc 

If you have just one replicate (PE), define fastqs with -fastq[REP_ID]_[PAIR_ID].

-fastq1_1 [READ_PAIR1] -fastq1_2 [READ_PAIR2] \

For multiple replicates (PE), define fastqs with -fastq[REP_ID]_[PAIR_ID]. Add -fastq[]_[] for each replicate and pair to the command line:replicates.

-fastq1_1 [READ_REP1_PAIR1] -fastq1_2 [READ_REP1_PAIR2] -fastq2_1 [READ_REP2_PAIR1] -fastq2_2 [READ_REP2_PAIR2] ...

For multiple replicates (SE), define fastqs with -fastq[REP_ID]:

-se -fastq1 [READ_REP1] -fastq2 [READ_REP2] ...

You can start from bam files. There are two kinds of bam files (raw or deduped) and you need to explicitly choose between raw bam (bam) and deduped one (filt_bam) with -input [BAM_TYPE]. Don't forget to add -se if they are not paired end (PE). For raw bams,

-bam1 [RAW_BAM_REP1] -bam2 [RWA_BAM_REP1] ...

For deduped (filtered) bams:

-filt_bam1 [FILT_BAM_REP1] -filt_bam2 [FILT_BAM_REP1] ...

For tagaligns (non-tn5-shifted):

-tag1 [TAGALIGN_REP1] -tag2 [TAGALIGN_REP2] ...

To subsample beds (tagaligns) add the following to the command line. This is different from subsampling for cross-corr. analysis. Peaks will be called with subsampled tagaligns.

-subsample [NO_READS_TO_SUBSAMPLE]

To change # of lines to subsample for cross-corr. analysis. This will not affect tasks downstream (peak calling and IDR).

-nreads [NO_READS_TO_SUBSAMPLE]

To disable pseudo replicate generation, add the following. By default, peak calling and IDR will be done for true replicates and pseudo replicates, but if you have -true_rep in the command line, you will also get IDR on true replicates only.

-true_rep

IDR analysis is included in the pipeline by default. For better IDR QC, define blacklist idr (for hg19, http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz). For other genomes, https://sites.google.com/site/anshulkundaje/projects/blacklists

-blacklist [BLACKLIST_BED]

If you don't want IDR, add the following:

-no_idr

To list all parameters and default values for them,

$ bds atac.bds
  1. Define parameters in configuration file. Key names in a configruation file are identical to parameter names on command line.
$ bds atac.bds [CONF_FILE]

$ cat [CONF_FILE]
fastq1_1= [READ1]
fastq1_2= [READ2]
...

Species file and Environment file

For Kundaje clusters and SCG3/4, do not define any species specific parameters like -tss_enrich, -ref_fa, -blacklist, -dnase, -prom, -enh, -reg2map, -roadmap_meta, -bwt2_idx, -gensz, -chrsz. They are already defined in ./species/kundaje.env and ./species/scg3.env. You just need to define the name of species -species [SPECIES; hg19, mm9, ...].

$ bds atac.bds -species [hg19 or mm9] ...

See details here

Parallelization and multi-threading (IMPORTANT!)

For completely serialized jobs, add the following. Individual jobs can still go multi-threaded.

-no_par

You can set up a limit for total # threads. Total # threads used by a pipeline will not exceed this limit. By default, it's 16 on SCG3/4 and 8 on Kundaje clusters and others.

-nth [MAX_TOTAL_NO_THREADS]

A pipeline automatically distributes [MAX_TOTAL_NO_THREADS] threads for jobs according to the corresponding input file sizes. For example of two fastqs (1GB and 2GB) with -nth 6, 2 and 4 threads are allocated for aligning 1GB and 2GB fastqs, respectively. The same policy applies to other multi-threaded tasks like deduping and peak calling.

However, all multi-threaded tasks (like bwa, bowtie2, spp and macs2) still have their own max. memory (-mem_APPNAME [MEM_APP]) and walltime (-wt_APPNAME [WALLTIME_APP]) settings. Max. memory is NOT PER CPU. On Kundaje cluster (with SGE flag activated bds -s sge chipseq.bds ...) or on SCG3/4, the actual shell command submitted by BDS for each task is like the following:

qsub -pe shm [NTH_ALLOCATED_FOR_APP] -h_vmem=[MEM_APP]/[NTH_ALLOCATED_FOR_APP] -h_rt=[WALLTIME_APP] ...

This ensures that total memory reserved for a cluster job equals to [MEM_APP].

How to efficiently manage multiple pipeline runs? (using UNIX screen)

./utils/bds_scr is a BASH script to create a detached screen for a BDS script and redirect stdout/stderr to a log file [LOG_FILE_NAME]. If a log file already exists, stdout/stderr will be appended to it. Monitor your pipeline with tail -f [LOG_FILE_NAME]. The only difference between bds_scr and bds is that you have [SCR_NAME] [LOG_FILE_NAME] between bds command and its parameters.

bds_scr [SCR_NAME] [LOG_FILE_NAME] atac.bds ...

You can skip [LOG_FILE_NAME] then a log file [SCR_NAME].log will be generated on the working directory.

bds_scr [SCR_NAME] atac.bds ...

You can also add any BDS parameters (like -dryRun, -d and -s). The following example is for running a pipeline on Sun Grid Engine.

bds_scr [SCR_NAME] [LOG_FILE_NAME] -s sge atac.bds ...

Once the pipeline run is done, the screen will be automatically closed. To kill a pipeline manually while it's running:

screen -X -S [SCR_NAME] quit

Or use ./utils/kill_scr:

kill_scr [SCR_NAME]

Temporary files on $TMP or /tmp

If you stop a BDS pipeline with Ctrl+C while calling peaks with spp. Temporary files generated by Rscript are not removed and they still exist on $TMP (or /tmp if not explicitly exported). You need to manually remove them.

How to add more functions to pipeline source code?

https://github.com/kundajelab/TF_chipseq_pipeline/blob/master/README_CODE.md

Requirements

For python2 (python 2.x >= 2.7) and R-3.x, here For python3, here

Troubleshooting

See more troubleshootings here

  1. pysam backward compatibility issue

ATAQC currently does not work with pysam >= 0.9. Lower it to 0.8.3.

Traceback (most recent call last):
  File "/users/leepc12/code/bds_atac/ataqc/run_ataqc.py", line 1303, in <module>
    main()
  File "/users/leepc12/code/bds_atac/ataqc/run_ataqc.py", line 1120, in main
    chr_m_reads, fraction_chr_m = get_chr_m(COORDSORT_BAM)
  File "/users/leepc12/code/bds_atac/ataqc/run_ataqc.py", line 160, in get_chr_m
    tot_reads += int(chrom_stats[2])
IndexError: list index out of range
  1. samtools ncurses bug

Prepend a directory for libncurses.so.5 to LD_LIBRARY_PATH. See install_dependencies.sh for solution.

samtools: symbol lookup error: /lib/x86_64-linux-gnu/libncurses.so.5: undefined symbol: _nc_putchar
  1. Error: could not find environment: bds_atac

Unload any Anaconda Python modules. Remove locally installed Anaconda Python from your $PATH.

Contributors

  • Jin wook Lee - PhD Student, Mechanical Engineering Dept., Stanford University
  • Chuan Sheng Foo - PhD Student, Computer Science Dept., Stanford University
  • Daniel Kim - MD/PhD Student, Biomedical Informatics Program, Stanford University
  • Anshul Kundaje - Assistant Professor, Dept. of Genetics, Stanford University