MetaCC allows scalable and integrative analyses of both long-read and short-read metagenomic Hi-C data
- Update
- Overview
- System Requirements
- Installation Guide
- A test dataset to demo MetaCC
- Instruction to process raw data
- MetaCC analysis
- Instruction of reproducing results in MetaCC paper
- Contacts and bug reports
- Copyright and License Information
- Issues
- v1.2.0 (05/2024): Remove the dependence of NormCC on the R scripts
MetaCC
is an efficient and integrative framework for analyzing both long-read and short-read metaHi-C datasets.
In the MetaCC
framework, raw metagenomic Hi-C contacts are first efficiently and effectively normalized
by a new normalization method, NormCC
. Leveraging NormCC-normalized Hi-C contacts,
the binning module in MetaCC
enables the retrieval of high-quality MAGs and downstream analyses.
-
If you want to reproduce results in our MetaCC paper, please read our instructions here.
-
Some scripts to process the intermediate data and plot figures of our MetaCC paper are available here.
MetaCC
requires only a standard computer with enough RAM to support the in-memory operations.
MetaCC
v1.2.0 is supported and tested in MacOS and Linux systems.
MetaCC
mainly depends on the Python scientific stack:
numpy
scipy
pysam
scikit-learn
pandas
Biopython
igraph
leidenalg
statsmodels
We recommend using conda to install MetaCC
.
Typical installation time is 1-5 minutes depending on your system.
git clone https://github.com/dyxstat/MetaCC.git
Once complete, enter the repository folder and then create a MetaCC
environment using conda.
cd MetaCC
Since MetaCC needs to execute the external softwares in the folder Auxiliary, you may need to run the following commands to make sure that all external softwares are executable:
chmod +x Auxiliary/test_getmarker.pl Auxiliary/FragGeneScan/FragGeneScan Auxiliary/FragGeneScan/run_FragGeneScan.pl Auxiliary/hmmer-3.3.2/bin/hmmsearch
conda env create -f MetaCC_env.yaml
conda activate MetaCC_env
We provide a small dataset, located under the Test directory, to test the software:
python ./MetaCC.py test
Follow the instructions in this section to process the raw shotgun and Hi-C data and generate the input for the MetaCC
framework:
Adaptor sequences are removed by bbduk
from the BBTools
suite with parameter ktrim=r k=23 mink=11 hdist=1 minlen=50 tpe tbo
and reads are quality-trimmed using bbduk
with parameters trimq=10 qtrim=r ftm=5 minlen=50
. Additionally, the first 10 nucleotides of Hi-C reads are trimmed by bbduk
with parameter ftl=10
. Identical PCR optical and tile-edge duplicates for Hi-C reads were removed by the script clumpify.sh
from BBTools
suite.
For the shotgun library, de novo metagenome assembly is produced by an assembly software, such as MEGAHIT.
megahit -1 SG1.fastq.gz -2 SG2.fastq.gz -o ASSEMBLY --min-contig-len 1000 --k-min 21 --k-max 141 --k-step 12 --merge-level 20,0.95
Hi-C paired-end reads are aligned to assembled contigs using a DNA mapping software, such as BWA MEM. Then, samtools with parameters ‘view -F 0x904’ is applied to remove unmapped reads, supplementary alignments, and secondary alignments. BAM file needs to be sorted by name using 'samtools sort'.
bwa index final.contigs.fa
bwa mem -5SP final.contigs.fa hic_read1.fastq.gz hic_read2.fastq.gz > MAP.sam
samtools view -F 0x904 -bS MAP.sam > MAP_UNSORTED.bam
samtools sort -n MAP_UNSORTED.bam -o MAP_SORTED.bam
Since the raw metagenomic Hi-C contacts are biased, MetaCC pipeline provides a comprehensive and scalable normalization module NormCC to eliminate the systematic biases of Hi-C contacts, which can significantly benefit the downstream analysis.
python /path_to_MetaCC/MetaCC.py norm [Parameters] FASTA_file BAM_file OUTPUT_directory
-e (required): Case-sensitive enzyme name. Use multiple times for multiple enzymes
--min-len: Minimum acceptable contig length (default 1000)
--min-mapq: Minimum acceptable alignment quality (default 30)
--min-match: Accepted alignments must be at least N matches (default 30)
--min-signal: Minimum acceptable signal (default 2)
--thres: the fraction of discarded normalized Hi-C contacts
(default 0.05, which means discarding the lowest 5% of normalized Hi-C contacts as spurious)
--cover (optional): Cover existing files. Otherwise, an error will be returned if the output file is detected to exist.
-v (optional): Verbose output about more specific details of the procedure.
- FASTA_file: a fasta file of the assembled contigs (e.g. Test/final.contigs.fa)
- BAM_file: a bam file of the Hi-C alignment (e.g. Test/MAP_SORTED.bam)
- contig_info.csv: information of assembled contigs with three columns (contig name, the number of restriction sites on contigs, and contig length).
- Normalized_contact_matrix.npz: a sparse matrix of normalized Hi-C contact maps in csr format and can be reloaded using Python command 'scipy.sparse.load_npz('Normalized_contact_matrix.npz')'.
- NormCC_normalized_contact.gz: Compressed format of the normalized contacts and contig information by pickle. This file can further serve as the input of MetaCC binning module.
- MetaCC.log: the specific implementation information of NormCC normalization module.
python ./MetaCC.py norm -e HindIII -e NcoI -v final.contigs.fa MAP_SORTED.bam out_directory
MetaCC binning module is based on the NormCC-normalized Hi-C contacts and thus must be implemented after the NormCC normalization module.
python /path_to_MetaCC/MetaCC.py bin --cover [Parameters] FASTA_file OUTPUT_directory
--min-binsize: Minimum bin size used in output (default 150,000)
--num-gene (optional): Number of marker genes detected. If there is no input,
the number of marker genes will be automatically detected.
--random-seed (optional): seed for the Leiden clustering. If there is no input, a random seed will be employed.
-v (optional): Verbose output about more specific details of the procedure.
- FASTA_file: a fasta file of the assembled contigs (e.g. Test/final.contigs.fa)
- OUTPUT_directory: please make sure that the output directory of the MetaCC binning module should be the same as that of the NormCC normalization module.
- BIN: folder containing the fasta files of draft genomic bins
- MetaCC.log: the specific implementation information of MetaCC binning module
python ./MetaCC.py bin --cover -v final.contigs.fa out_directory
Draft genomic bins are assessed using CheckM2/CheckM. Then the post-processing step of the MetaCC binning module is conducted for partially contaminated bins with completeness larger than 50% and contamination larger than 10% in order to purify the partially contaminated bins.
python /path_to_MetaCC/MetaCC.py postprocess --cover [Parameters] FASTA_file Contaminated_Bins_file OUTPUT_directory
--min-binsize: Minimum bin size used in output (default 150,000)
-v (optional): Verbose output about more specific details of the procedure.
- FASTA_file: a fasta file of the assembled contigs (e.g. Test/final.contigs.fa).
- Contaminated_Bins_file: a csv file of the names of the partially contaminated bins; Bin names are arranged in columns and don't include the file formats .fa at the end of each name.
- OUTPUT_directory: please make sure that the output directory of the post-processing step of the MetaCC binning module should be the same as the previous steps.
Example of a Contaminated_Bins_file:
BIN0001
BIN0003
BIN0005
...
- BIN: folder containing the fasta files of draft genomic bins after cleaning partially contaminated bins.
- MetaCC.log: the specific implementation information of the post-processing step of the MetaCC binning module.
python ./MetaCC.py postprocess --cover -v final.contigs.fa contaminated_bins.csv out_directory
If you have any questions or suggestions, welcome to contact Yuxuan Du ([email protected]).
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.