MGnify genome analysis pipeline

MGnify CWL pipeline to characterize a set of isolate or metagenome-assembled genomes (MAGs) using the workflow described in the following publication:

A Almeida, S Nayfach, M Boland, F Strozzi, M Beracochea, ZJ Shi, KS Pollard, DH Parks, P Hugenholtz, N Segata, NC Kyrpides and RD Finn. A unified sequence catalogue of over 280,000 genomes obtained from the human gut microbiome. bioRxiv. doi: https://doi.org/10.1101/762682

Installation

Install the necessary dependencies:

cwltool (tested v1.0.2)
R (tested v3.5.2). Packages: reshape2, fastcluster, optparse, data.table and ape.
Python v2.7 and v3.6
CheckM (tested v1.0.11)
CAT (tested v5.0)
GTDB-Tk (tested v0.3.1 and v1.0.2)
dRep (tested v2.2.4)
Prokka (tested 1.14.0)
Roary (tested 3.12.0)
MMseqs2 (tested v8-fac81)
InterProScan (tested v5.35-74.0 and v5.38-76.0)
eggNOG-mapper (tested v2.0)

Make sure all these tools, as well as the custom_scripts/ folder, are added to your $PATH environment.
Edit custom_scripts/taxcheck.sh to point CAT to the installed diamond and database paths (variables $diamond_path, $cat_db_path and $cat_tax_path)

How to run

Add path of input genomes folder to YML file: workflows/yml_patterns/wf-1.yml
Run first workflow with:
cwltool workflows/wf-1.cwl workflows/yml_patterns/wf-1.yml > output-wf-1.json
Output json will be saved to a separate file.
Run parser of output json
python3 workflows/parser_yml.py -j output-wf-1.json -y workflows/yml_patterns/wf-2.yml
Check exit code of parser
echo $?
If exit code == 1, run:
cwltool workflows/wf-exit-1.cwl workflows/yml_patterns/wf-2.yml
If exit code == 2, run:
cwltool workflows/wf-exit-2.cwl workflows/yml_patterns/wf-2.yml
If exit code == 3, run:
cwltool workflows/wf-exit-3.cwl workflows/yml_patterns/wf-2.yml
Note: You can manually change parameters of MMseqs2 for protein clustering in workflows/yml_patterns/wf-2.yml

Output files/folders:

checkm_quality.csv
gtdb-tk_output/
taxcheck_output/
mmseqs_output/
mash_trees/
cluster__X
cluster__...

Pipeline structure

Tool description

CheckM: Estimate genome completeness and contamination.
TaxCheck: Wrapper of the contig annotation tool (CAT) to predict taxonomy consistency across contigs.
GTDB-Tk: Genome taxonomic assignment using the GTDB framework.
dRep: Genome de-replication.
Mash2Nwk: Generate Mash distance tree of conspecific genomes.
Prokka: Predict protein-coding sequences from genome assembly.
Roary: Infer pan-genome from a set of conspecific genomes.
MMseqs2: Cluster protein-coding sequences.
InterProScan: Protein functional annotation using the InterPro database.
eggNOG-mapper: Protein functional annotation using the eggNOG database.

Part 1 (quality control, clustering and taxonomic assignment): wf-1.cwl

1.1) checkm 
1.2) checkm2csv 
1.3) dRep 

1.4.1) GTDB-Tk

1.4.2) split_drep.py
1.5) classify_folders.py

2) taxcheck

output:

checkm_csv
gtdbtk folder
taxcheck_dirs
plus folders for the next step
one_genome (list of clusters/folders that have only one genome)
many_genomes (list of clusters/folders that have more than one genome)
mash_folder (list of mash-files from "many_genomes" clusters)

Part 2 (functional annotation)

Check
======> if many_genomes and one_genome present: run wf-exit-1.cwl

    **2.1. For many_genomes part**
        
        1) Prokka
        2) Roary
        3) translate from fa to faa
        4.1) IPS
        4.2) EggNOG
        
    output: OUTPUT_MANY
     - mash_trees
     - cluster folder(-s)
     - prokka concatenated faa result
     
     
    **2.2. For one_genome part** 
    
        1) Prokka
        2.1) IPS
        2.2) EggNOG
        
    output: OUTPUT_ONE
     - cluster folder(-s)
     - prokka concatenated faa result

    **2.3. Final part**   
    
        1) cat prokka from many and one
        2) mmseqs 
    output: OUTPUT_3
     - mmseqs folder

======> if many_genomes present BUT one_genome NOT present: run wf-exit-2.cwl
Step 2.1 + 2.3

======> if many_genomes NOT present BUT one_genome present: run wf-exit-3.cwl
Step 2.2 + 2.3

Part 3 (clean-up)

Copies all relevant output to one result folder

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
custom_scripts		custom_scripts
input_example		input_example
output_example/test-exit-1		output_example/test-exit-1
tools		tools
utils		utils
workflows		workflows
README.md		README.md
pipeline_overview.png		pipeline_overview.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MGnify genome analysis pipeline

Installation

How to run

Pipeline structure

Tool description

Part 1 (quality control, clustering and taxonomic assignment): wf-1.cwl

Part 2 (functional annotation)

Part 3 (clean-up)

About

Releases

Packages

Languages

mr-c/genomes-pipeline

Folders and files

Latest commit

History

Repository files navigation

MGnify genome analysis pipeline

Installation

How to run

Pipeline structure

Tool description

Part 1 (quality control, clustering and taxonomic assignment): wf-1.cwl

Part 2 (functional annotation)

Part 3 (clean-up)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages