This bundle of software is a basic implementation of the algorithm for extracting Peak-to-Trough Ratios from Metagenomic data, as first described in (Korem et. al, Science, 2015).
Make sure that "pip" is the PyPi command of your python2 installation, then:
pip install menace
git clone [email protected]:zertan/Menace.git
cd Menace
python setup.py install
This should install the below python dependencies. The other dependencies have to be installed manually (if you have questions about this I suggest you consult your cluster IT help desk).
The software has been tested on the "hebbe" cluster at C3SE which uses the "slurm" system for resource management (thus slurm is the only queueing system currently supported).
Python2:
numpy
scipy
pandas
biopython
matplotlib
xmltodict
configparser
lmfit
newick
Jinja2
doric
-e git+https://github.com/PathoScope/PathoScope.git#egg=pathoscope
Pathoscope 2.0 (should be installed by the above pip command but make sure 'pathoscope ID' is accessible in the shell, ie. is on the system path)
DoriC is a databse of chromosome origin locations (OriCs) which is a (recommended) optional dependency for the pipeline. Please visit the link and enter your e-mail to download.
You can get an overview of the menace functionality by running menace -h
.
-
Initialize a project in current directory by running
menace init
. Identify a set of NCBI genome reference accession numbers and put them in "./searchStrings" (or use the default one which includes a minimal set of references to bacteria common in the human gut). -
Identify a metagenomic cohort of interest (download manually or add URLs as described below) and add to the Data folder. Supported input: raw/gzipped/bzipped ".fastq" files.
-
Add information to the
project.conf
file. -
Edit
loadmodules.sh
to include the python2 module of the cluster (or comment out the lines if python2 is accessible by default). -
Run
menace full
(use "nohup {cmd} &" to keep alive after logout if on a cluster login node). -
Wait for job to complete. Run
menace collect
in project directory.
The menace script is a common utility for all parts of the pipeline including downloading of references and metagenomic data, bulding a reference index, setting up the necessary file structure and submitting to slurm. Hence, all configuration is intended to be set up in project.conf (please see bin/project.conf.example
for an example).
The default 'searchStrings' will most probably not fit your purposes but is only an example. A more comprehensive Reference library will yield higher coverage and more accurate values. A more comprehensive list of human gut bacteria is available at 'extra/referenceACClong.txt'.
With the above usage example the path structure(s) will look something like below.
$DATA_PATH
├ "Sample01" (eg. ERR525688)
. ├ {sample01_1.fastq.gz}
. └ {sample01_2.fastq.gz} paired metagenomic reads
.
$REF_PATH
├ Index
| └ {REF_NAME.*.bt2l} bowtie2 index files
├ Fasta
| └ {accession.fasta}
├ Headers
| └ {accession.xml} xml files containing extra genome references info
└ taxIDs.txt
$DORIC_PATH
├ bacteria_record.dat
└ bacteria_seq.fas
$OUTPUT_PATH
├ "Sample01"
. ├ depth
. | └ {accession.depth} coverage files for each reference
. ├ log
| └ {accession.log} output logs from piecewiseFit
├ npy
| └ {accession_OriC_TerC.npy} numpy files with origin/terminus locations and relative C periods
├ png
| └ {accession_fit.png} images of piecewise fit of the smoothed coverage
└ accession-sam-report.tsv Pathoscope2 reassignment report
Below follows a description of the main scripts in the package.
A submit script for sending a batch job to slurm for parallel processing on a computing cluster.
input: none
output: directory structure as specified in "project.conf"
The main build script with commands intended to be executed on the cluster.
input: none
output: temporary paths and files on compute nodes
Traverses the specified directory generated by mainBuild.sh and assembles information from each sample into tabular form (eg. averages origin locations from many samples for a better estimate).
input: $OUTPUT_PATH, $DORIC_PATH, $REF_PATH, bin/accLoc.csv
output: Abundance.csv, PTR.csv, DoublingTime.csv, Header.csv
Implements the piecewise linear fit and prior checks on the generated depth files to filter out those instances in which enough data was generated to produce a reliable coverage signal for estimating replication origins. This data can be used further on, once those has been estimated using the full cohort, to produce PTR-vaules for each sample.
input: {reference.depth}
output: {reference_OriC.npy}, {reference_TerC.npy}, {reference_coverage.png}, {reference_fit.log}
This utility can be used to download '.fasta' reference files from the NCBI servers.
input: searchStrings.txt,
output: {reference.fasta}, {reference.xml}, taxIDs.txt