doc/cdhit-user-guide.wiki

usecases/Miseq-16S/NG-Omics-Miseq-16S.pl clean up some local setting usecases/Miseq-16S/NG-Omics-WF.pl copy from ngomicswf for easy package
yesimon · Apr 21, 2017 · 83b7726 · 83b7726
1 parent 792cc40
commit 83b7726
Show file tree

Hide file tree

Showing 4 changed files with 1,281 additions and 21 deletions.
diff --git a/doc/cdhit-user-guide.wiki b/doc/cdhit-user-guide.wiki
@@ -4,7 +4,7 @@ Last updated: ~~LASTMOD~~
 
 [[http://cd-hit.org]]
 
-Program developed by Weizhong Li's lab at UCSD [[http://weizhong-lab.ucsd.edu]] and JCVI [[http://jcvi.org]] [[[email protected]]]
+Program developed by Weizhong Li's lab at UCSD [[http://weizhongli-lab.org]] and JCVI [[http://jcvi.org]] [[[email protected]]]
 
 ===== Introduction =====
 
@@ -200,7 +200,8 @@ Alignment coverage control:
 
 See the figure below, the -aL, -AL, -aS and -AS options can be used to specify the alignment coverage on both the representative sequence and other sequences. -s and -S can control the length difference between the representative sequence and other sequences.
 
-{{ :Figure2.png }}
+{{ :cd-hit-figure2.png }}
+
 
 ''
 aL = R<sub>a</sub> / R\\
@@ -423,7 +424,7 @@ Implementation (see figure below)
   - repeat cd-hit and cd-hit-2d runs till done
   - Combine the results
 
-{{ :Figure3.png }}
+{{ :cd-hit-figure3.png }}
 
 Basic command:
   cd-hit-para.pl -i nr90 -o nr60 -c 0.6 -n 4 --B hosts --S 64
@@ -506,8 +507,12 @@ stable cluster structure.
 
 With multiple-step, iterated runs of CD-HIT, you perform a clustering in a
 neighbor-joining method, which generates a hierarchical structure. The third step use psi-cd-hit, please see psi-cd-hit section for details. 
+
+This way is faster than one-step clustering. It can also be more accurate. 
+
+There is a problem with one-step clustering. Two very similar sequences A and B may be clustered into different clusters. For example, let the clustering threshold to be 60%, IAB (identity of AB) = 95%, IAC ≥ 60%, but IBC < 60%. If C was first selected a cluster representative, then A will be in cluster “C”, but “B” will not, resulting near identical AB to be in different clusters. Hierarchically clustering will reduce this problem.
 
-{{ :Figure4.png }}
+{{ :cd-hit-figure4.png }}
 
 Commands:
   cd-hit -i nr -o nr80 -c 0.8 -n 5 -d 0 -M 16000 -T 16
@@ -525,8 +530,7 @@ nr60.clstr only lists sequences from nr80, script clstr_rev.pl add the original
   clstr_rev.pl nr80-60.clstr nr30.clstr > nr80-60-30.clstr
 nr30.clstr only lists sequences from nr60, script clstr_rev.pl add the original sequences into file nr80-60-30.clstr
 
-This way is faster than one-step run from nr directly to nr30. It can also
-more accurate.
+
 
 
 
@@ -1164,7 +1168,80 @@ server performs hierarchical clustering up to 3 steps.
 
 The CD-HIT-454 web server is also available from [[http://cd-hit.org]]. 
 
-
+===== Use cases =====
+Here, a use case is defined as a sequence clustering related problem or application that cannot be easily solved with existing clustering approaches, such as CD-HIT. However, it is feasible to solve such a use case by customizing current clustering algorithms or utilizing current approach in a very intelligent way or non-standard manner. In the last years, we have developed many use cases in addressing various problems. We will release these use cases after additional testing. These use cases will be described in the following chapters.
+
+===== CD-HIT-OTU-MiSeq =====
+This use case is developed for clustering 16S rRNA genes into OTUs for microbiome studies. In recent years, Illumina MiSeq sequencers became dominant in 16S rRNA sequencing. The Paired End (PE) reads need to be assembled first. However many reads can not be accurately assembled because the poor quality at the 3’ ends of both PE reads in the overlapping region. This causes that many sequences are discarded in the analysis. CD-HIT-OTU-MiSeq has unique features to cluster MiSeq 16S sequences. 
+  - The package can clustering PE reads without joining them into contigs. 
+  - Users can choose a high quality portion of the PE reads for analysis (e.g. first 200 / 150 bases from forward / reverse reads), according to base quality profile. 
+  - We implemented a tool that can splice out the target region (e.g. V3-V4) from a full-length 16S reference database into the PE sequences. CD-HIT-OTU-MiSeq can cluster the spliced PE reference database together with samples, so we can derive Operational Tax-onomic Units (OTUs) and annotate these OTUs concurrently. 
+  - Chimeric sequences are effectively identified through both de novo and reference-based approaches. 
+
+The most important unique feature of CD-HIT-OTU-MiSeq is to only use high quality region at the 5’ ends of R1 and R2 reads. For example, the effective read length can be 200 bases for R1 and 150 bases for R2. The effective portions of PE reads are clustered together with spliced PE sequences from the reference database to derive OTUs (Figure).
+
+{{:cd-hit-otu-miseq-figure-1.png?300|}}
+
+==== Installation ====
+First download and install full cd-hit package
+  * download current CD-HIT at [[https://github.com/weizhongli/cdhit/releases]], for example cd-hit-v4.6.2-2015-0511.tar.gz
+  * unpack the file with " tar xvf cd-hit-v4.6.2-2015-0511.tar.gz --gunzip"
+  * change dir by "cd cd-hit-v4.6.2-2015-0511"
+  * compile the programs by "make" with multi-threading (default), or by "make openmp=no" without multi-threading (on old systems without OpenMP)
+  * cd cd-hit-auxtools
+  * compile cd-hit-auxtools by "make"
+  * CD-HIT-OTU-MiSeq scripts are inside a folder like cd-hit-v4.6.2-2015-0511/usecases/Miseq-16S
+
+CD-HIT-OTU-MiSeq uses Trimmomatic for sequence quality control. It can be downloaded from [[http://www.usadellab.org/cms/?page=trimmomatic]] or [[https://github.com/timflutre/trimmomatic]]. We also have a copy at [[http://weizhongli-lab.org/download-data/cd-hit-otu-miseq/]].
+
+  * modify NG-Omics-Miseq-16S.pl
+Please edit usecases/Miseq-16S/NG-Omics-Miseq-16S.pl, in the top few lines:
+  $CD_HIT_dir   = "PATH_to_cd-hit";
+  $NGS_prog_trimmomatic = "PATH/trimmomatic-0.32.jar";  #### where you have installed Trimmomatic
+
+==== Download reference and sample datasets ====
+Reference database and sample datasets can be downloaded from [[http://weizhongli-lab.org/download-data/cd-hit-otu-miseq/]].
+
+The reference database Greengene-13-5-99.fasta.gz was processed from original Greengene database, so that sequences with more specific annotations are at the beginning of the file. You need to download and gunzip it.
+
+You can also download Greengene and generate it. You should download Greengene from [[http://greengenes.secondgenome.com/downloads]], or [[ftp://greengenes.microbio.me/]]. Please download file like greengenes_release/gg_13_5/gg_13_5_otus.tar.gz, unpack the tar file. You may find gg_13_5_otus/taxonomy/99_otu_taxonomy.txt and gg_13_5_otus/rep_set/99_otus.fasta. There is a script: usecases/Miseq-16S/greengene-ann1.pl.  
+
+Commands:
+  /greengene-ann1.pl -i gg_13_5_otus/taxonomy/99_otu_taxonomy.txt -j gg_13_5_otus/rep_set/99_otus.fasta -o Greengene-13-5-99.fasta
+
+The Miseq-otu-example.tar.gz contains two Miseq 16S samples. You can download and unpack to test.
+
+==== Usage ====
+
+**Step 1. prepare fastq files and sample file:** Most projects have multiple samples sequenced at the same region. You should already have paired ended fastq files for these samples, put them in a working directory in similar way as the testing datasets, where the R1.fq and R2.fq are placed in separate folder for each sample. So in the working directory, you should have files:
+  sample_name_1/R1.fq
+  sample_name_1/R2.fq
+  sample_name_2/R1.fq
+  sample_name_2/R2.fq
+  ...
+  sample_name_N/R1.fq
+  sample_name_N/R2.fq
+  
+Then, please prepare a sample file in the working directory. The file should look like:
+  sample_name_1 R1.fq R2.fq
+  sample_name_2 R1.fq R2.fq
+  sample_name_N R1.fq R2.fq
+
+**Step 2. Reference database preparation:** We implemented a tool that can splice out the target amplicon region (e.g. V3-V4) from a full-length 16S rRNA reference sequence database, such as Greengene, RDP and Silva, into PE sequences. If there are multiple samples in a project sequenced with the same amplicon of same variable region, only one spliced reference database is needed. To run:
+
+  path_to_cd-hit_dir/usecases/Miseq-16S/16S-ref-db-PE-splice.pl -i sample_name_1/R1.fq -j sample_name_2/R2.fq  -d Greengene-13-5-99.fasta -o gg_13_5-PE99.150-100 -p 150 -q 100 -c 0.99
+Where Greengene-13-5-99.fasta is our re-formatted Greengene sequence file. This program will output spliced PE files gg_13_5-PE99.150-100-R1 and gg_13_5-PE99.150-100-R2.
+
+**Step 3. Run sequence QC and OTU clustering for each sample:**. In the working directory, run
+  PATH_to_cd-hit-dir/usecases/NG-Omics-WF.pl -i PATH_to_cd-hit-dir/usecases/NG-Omics-Miseq-16S.pl -s sample_file -T otu:150:100:0.97:0.0001:PATH_to-gg_13_5-PE99.150-100-R1:PATH_to-gg_13_5-PE99.150-100-R2:75 -J write-sh
+where: 150 and 100 are the effective length, 0.97 is the OTU clustering cutoff, 0.00001 is the abundance cutoff, 75 is the length for chimeric checking at each R1 and R2 read
+
+This command will generate shell scripts for QC and for OTU for each sample. The scripts will be in WF-sh folder. You can first run the qc.sample_name.sh and then run otu.sample_name.sh 
+
+
+
+
+
 ===== References =====
 
 If you find cd-hit helpful to your research and study, please kindly cite the

diff --git a/usecases/Miseq-16S/NG-Omics-Miseq-16S.pl b/usecases/Miseq-16S/NG-Omics-Miseq-16S.pl
@@ -4,14 +4,9 @@
 ################################################################################
 
 ########## local variables etc. Please edit
-$NGS_root     = "/home/oasis/data/NGS-ann-project";
-$CD_HIT_dir   = "/home/oasis/data/etc/git/cdhit";
+$CD_HIT_dir           = "/home/oasis/data/etc/git/cdhit";
+$NGS_prog_trimmomatic = "/home/oasis/data/NGS-ann-project/apps/Trimmomatic/trimmomatic-0.32.jar";
 
-########## more local variables, do not edit next three lines
-$NGS_tool_dir = "$NGS_root/NGS-tools";
-$NGS_prog_dir = "$NGS_root/apps";
-$NGS_bin_dir  = "$NGS_root/apps/bin";
-$NGS_ref_dir  = "$NGS_root/refs";
 
 ########## computation resources for execution of jobs
 %NGS_executions = ();
@@ -27,10 +22,7 @@
   "template"            => <<EOD,
 #!/bin/sh
 #PBS -v PATH
-#PBS -M liwz\@sdsc.edu
-#PBS -q normal
 #PBS -V
-#PBS -l nodes=1:ppn=16,walltime=48:00:00,mem=60000mb
 
 #\$ -v PATH
 #\$ -V
@@ -41,7 +33,7 @@
 
 $NGS_executions{"sh_1"} = {
   "type"                => "sh",
-  "cores_per_node"      => 32,
+  "cores_per_node"      => 8,
   "number_nodes"        => 1,
 };
 
@@ -51,7 +43,7 @@
   "cores_per_cmd"     => 4,                     # number of threads used by command below
   "no_parallel"       => 1,                     # number of total jobs to run using command below
   "command"           => <<EOD,
-java -jar $NGS_prog_dir/Trimmomatic/trimmomatic-0.32.jar PE -threads 4 -phred33 \\DATA.0 \\DATA.1 \\SELF/R1.fq \\SELF/R1-s.fq \\SELF/R2.fq \\SELF/R2-s.fq \\
+java -jar NGS_prog_trimmomatic PE -threads 4 -phred33 \\DATA.0 \\DATA.1 \\SELF/R1.fq \\SELF/R1-s.fq \\SELF/R2.fq \\SELF/R2-s.fq \\
     SLIDINGWINDOW:4:20 LEADING:3 TRAILING:3 MINLEN:\\CMDOPTS.0 MAXINFO:80:0.5 1>\\SELF/qc.stdout 2>\\SELF/qc.stderr
 
 perl -e '\$i=0; while(<>){ if (/^\@/) {\$i++;  print ">Sample|\\SAMPLE|\$i ", substr(\$_,1); \$a=<>; print \$a; \$a=<>; \$a=<>;}}' < \\SELF/R1.fq > \\SELF/R1.fa &
@@ -102,8 +94,6 @@
 EOD
 };
 
-
-
 ##############################################################################################
 ########## END
 1;