metaCRISPR is a tool to assemble CRISPRs (Clustered Regularly Interspaced Short Palindromic Repeats and Associated Proteins) from metagenomic sequencing data without relying on generic assembly, which is error-prone and computationally expensive for complex data. It can run on commonly available machines in small labs. It employs properties of CRISPRs to decompose generic assembly into local assembly.
To run MetaCRISPR, one needs to have the following tools/packages installed:
- Java.
- Maven. This is needed to make the java package for read identification.
- Python
- networkx
- Genometools. MetaCRISPR uses readjoiner to construct overlap graph from reads.
- Bowtie.
For Linux users these tools/packages can be easily installed using package management systems of the Linux distribution (e.g. apt-get on Ubuntu). For MacOS users, they can be installed using HomeBrew.
In the ReadRecuiter
folder, run:
mvn package
This will generate a jar file in the ReadRecuiter/target folder. The jar can be used to identify CRISPR reads from metagenomic reads data.
MetaCRIPSRs pipeline has three major steps: identify CRISPR reads, cluster CRISPR reads, and assembly.
java -cp ReadRecuiter-0.0.1-SNAPSHOT-jar-with-dependencies.jar crispr.ReadRecuiter -input test.fa -minRepeat 23 -maxRepeat 60 -minSpacer 20 -
maxSpacer 80 -threads 4 -prefix test -maxMismatch 1 -repeats repeats.db..txt
This command identify CRISPR reads from input read file test.fa. Command line options are:
minRepeat
: minimum repeat size
maxRepeat
: maximum repeat size
minSpacer
: minimum spacer size
maxSpacer
: maximum spacer size
threads
: number of threads to use to do read identification.
maxMismatch
: maximum mismatch allowed in repeat sequences.
prefix
: prefix for output files.
repeats
: a file that contains known CRISPR direct repeats. Each line contains the sequence of a known repeat. This is optional.
If one uses N threads to run the identification program, it will generate N sets of files whose names all starts with the prefix set by user. To run the next step, concatenate N files with name prefix.{0-N-1}.filtered.fa together to form one file:
cat prefix.*.filtered.fa > filtered.reads.fa
~/CRISPRFinder-rsycn This file contains all filtered reads.
python scripts/run-rj-on-filtered-reads.py -b -t1 4 -t2 4 -o rj filtered.reads.fa mock 80
This will generate all clustered result data to folder rj
(-o option). -t1
option specifies the number of threads to use to run readjoiner. -t2
specifies the number of threads to use to run bowtie.
There are three required parameters:
- Filtered reads fasta file (
filterd.reads.fa
in the example) - Name of the read set for Readjoiner. One can think of this as an identifier to identify the sample of the data. (
mock
in the example). - Overlap size to build the overlap graph.
After generating the clusters, using the following command to filter the clusters:
mkdir clusters
cd clusters
python scripts/combine-and-filter-clusters.py ../rj/bowtie/bowtie.result.sam ../rj/rj/mock.clusters.0.txt ../rj/rj/mock.reads.filtered.fa ../rj/rj/mock.graph.rename.txt .../filtered.reads.fa
This produces reads for each belongs to each clusters to the clusters folder.
python scripts/run-rj-each-cluster.py clusters scripts/run-rj-on-filtered-reads.py cluster-each-rj
python scripts/crisprfinder-rj.py clusters rj/mock.readgroups.txt cluster-each-rj/rj 30 50
Here clusters
is the folder that contains all the fasta sequences of each cluster generated in the previous step. 30
is the size of overlap to generate the overlap graph in the initial assemble step. 50
is the confident edge size.