A toolkit for measuring and comparing ATAC-seq results, made in the Parker lab at the University of Michigan. We wrote it to help us understand how well our ATAC-seq assays had worked, and to make it easier to spot differences that might be caused by library prep or sequencing.
The main program, ataqv, examines your aligned reads and reports some basic metrics, including:
- reads mapped in proper pairs
- optical or PCR duplicates
- reads mapping to autosomal or mitochondrial references
- the ratio of short to mononucleosomal fragment counts
- mapping quality
- various kinds of problematic alignments
If you also have a file of peaks called on your data, that file can be examined to report read coverage of the peaks.
The report is printed as plain text to standard output, and detailed metrics are written to JSON files for further processing.
A web-based visualization and comparison tool and a script to prepare the JSON output for it are also provided. The web viewer includes interactive tables of the metrics and plots of fragment length, distance from a fragment length reference distribution, mapping quality, counts of reads overlapping peaks, and peak territory.
Web viewer demo: https://parkerlab.github.io/ataqv/demo/
If you have questions or suggestions, mail us at [email protected].
To build ataqv, you need:
The mkarv
script that collects ataqv results and sets up a web
application to visualize them requires Python 2.7 or newer.
To run the test suite, you'll also need LCOV, which can be installed via Homebrew or Linuxbrew. Note that on Macs with XCode 8, LCOV <= 1.12 will not be able to find the coverage files, because of Apple's constant changes to their gcov version output. This has been fixed in LCOV, but not yet released -- when it is, and its Homebrew formula is updated, the test suite coverage report should work on Macs.
The easiest way to install ataqv is via Homebrew on Macs, or Linuxbrew on Linux, using our tap. At a shell prompt:
brew tap ParkerLab/tap brew install ataqv
You can also just clone the Git repository and build with make
.
At your shell prompt:
git clone https://github.com/ParkerLab/ataqv cd ataqv make
If Boost and htslib are not available in default system locations (for
example if you're using environment modules, or compiling in your home
directory) you'll probably need to give make
some hints via the
CPPFLAGS
and LDFLAGS
variables:
make CPPFLAGS="-I/path/to/boost/include -I/path/to/htslib/include" LDFLAGS="-L/path/to/boost/lib -L/path/to/htslib/lib"
If the environment variables BOOST_ROOT
or HTSLIB_ROOT
are set
to directories containing include
and lib
subdirectories, the
compiler configuration can be made simpler:
make BOOST_ROOT=/path/to/boost HTSLIB_ROOT=/path/to/htslib
Or you can specify directories in BOOST_INCLUDE, BOOST_LIB, HTSLIB_INCLUDE, and HTSLIB_LIB separately.
If you use custom locations, like this, you will probably need to set LD_LIBRARY_PATH for the shared libraries to be found at runtime:
export LD_LIBRARY_PATH=/path/to/boost/lib:/path/to/htslib/lib:$LD_LIBRARY_PATH
If your Boost installation used their "tagged" layout, the libraries
will include metadata in their names; on Linux this usually just means
that they'll have a -mt
suffix to indicate multithreading
support. Specify BOOST_TAGGED=yes
in your make commands to link
with those.
If htslib was built to use libcurl, you'll need to link with that as well:
make HTSLIBCURL=yes
You can just copy build/ataqv
and src/scripts/*
wherever you
like, or run them from your copy of the ataqv repository. If you want
to install them to a bin directory somewhere, for example
/usr/local/bin, you can run:
make install PREFIX=/usr/local
Support for the Environment Modules system is also included. You
can install to the modules tree by defining the MODULES_ROOT
and
MODULEFILES_ROOT
variables. If your modules are kept under
/opt/modules
, with their accompanying module files under
/opt/modulefiles
, run:
make install-module MODULES_ROOT=/opt/modules MODULEFILE_ROOT=/opt/modulefiles
And then you should be able to run module load ataqv
to have
everything available in your environment.
You'll need to have a BAM file containing alignments of your ATAC-seq reads to your reference genome. If you want accurate duplication metrics, you'll also need to have marked duplicates in that BAM file. If you have a BED file containing peaks called on your data, ataqv can produce some additional metrics using that.
Verifying ataqv results with data from a variety of common tools is on
our to-do list, but so far, we've only used bwa, Picard's
MarkDuplicates, and MACS2 for these steps. A pipeline like ours
can be generated with the included make_ataqv_pipeline
script. Its
output product starts from a BAM file of aligned reads, marks
duplicates and calls peaks, then runs ataqv and produces a web viewer
for the output.
The main program is ataqv. Run ataqv --help
for complete
instructions.
When run, ataqv prints a human-readable summary to its standard output, and writes complete metrics to the file named with the --metrics-file option.
The JSON output can be incorporated into a web application that
presents tables and plots of the metrics, and makes it easy to compare
results across samples or experiments. Use the mkarv
script to
create a local instance of the result viewer. A web server is not
required, though you can use one to publish your result viewer
instance.
The ataqv package includes a script that will set up and run our entire ATAC-seq pipeline on some sample data.
You'll need to have installed ataqv itself, plus Picard tools, samtools, and MACS2 to run the pipeline. On a Mac, you can obtain everything with:
$ brew install ataqv picard-tools samtools $ pip install MACS2
On Linux, installation of the dependencies is probably specific to
your environment and is left as an exercise for the reader. On Debian,
apt-get install picard-tools samtools
followed by installing MACS2
with pip install MACS2
should be enough.
Once you have the prerequisite programs installed, you can run the example pipeline with:
$ run_ataqv_example /output/path
Part of this project will be publishing ataqv output for as many ATAC-seq experiments as we can get our hands on, so we can compare them and learn how changes to the protocol affect the output. Watch our GitHub docs for updates.
It's not currently concurrent, so don't allocate it more than a single processor. Memory usage should typically be no more than a few hundred megabytes.
Anecdotally, processing a 41GB BAM file containing 1,126,660,186 alignments of the data from the ATAC-seq paper took just under 20 minutes and 2GB of memory. Adding peak metrics extended the run time to almost 40 minutes, but it still used the same amount of memory.