Updated code to version to submit to Oxford Bioinformatics

hallamlab · Aug 3, 2015 · aeba3f5 · aeba3f5
1 parent b37cbf3
commit aeba3f5
Show file tree

Hide file tree

Showing 9 changed files with 104 additions and 64 deletions.
diff --git a/lca_star_analysis/LCAStar.Rmd b/lca_star_analysis/LCAStar.Rmd
@@ -35,26 +35,32 @@ theme_set(theme_bw())
 
 # Overview
 
-This document summarizes the analysis performed in "LCA*: an entropy-based measure for taxonomic assignment within assembled metagenomes" describing the use of the simulation the Small and Large metagenomes from a coolection of genomes downloaded from the NCBI, the running of the MetaPathways pipeline [[@Konwar:2013cw], [@Hanson2014]], use of the `LCAStar.py` Python script for the calculation of the $LCAStar$, $Simple Majority$, $LCA^2$ taxonomic estimation methods. 
+This document summarizes the analysis performed in "LCA*: an entropy-based measure for taxonomic assignment within assembled metagenomes", describing ths construction of the Small and Large metagenomes from a coolection of genomes downloaded from the NCBI, the running of the MetaPathways pipeline [[@Konwar:2013cw], [@Hanson2014]], use of the `LCAStar.py` Python library for the calculation of the $LCAStar$, $Simple Majority$, $LCA^2$ taxonomic estimates for assembled metagenomic contigs.
 
-Each simulation was run through the MetaPathways v2.0 pipeline [[@Hanson2014]], having its ORFs predicted and annotated against the RefSeq database using standard settings (Version 62, BSR=0.4, E-value=1e-5, Bit-score=20) [[@Pruitt:2007fi]]. Taxonomy for each ORF was annotated using the Lowest Common Ancestor (LCA) (or Best-BLAST in the GEBA SAGs), and contig taxonomy was predicted from these taxonomic three different methods: $LCA^2$, $Simple Majority$, and our information-theoretic $LCAStar$. $LCA^2$ simply applies the LCA method again to the set of taxonomic ORF annotations on a contig. $Simple Majority$ ascribes the taxonomy of the contig to the taxonomy that has the greatest number of annotations. Our $LCA*$ method applies the information-theoetic result and algorithm previously described with a majority theshold set to the default majority ($\alpha=0.51$).
+Each simulation was run through the MetaPathways v2.0 pipeline [[@Hanson2014]], having its ORFs predicted and annotated against the RefSeq database using standard settings (Version 62, BSR=0.4, E-value=1e-5, Bit-score=20) [[@Pruitt:2007fi]]. Taxonomy for each ORF was annotated using the Lowest Common Ancestor (LCA) (or Best-BLAST in the GEBA SAGs), and contig taxonomy was predicted using three different methods: LCASquared, Simple Majority, and our information-theoretic LCAStar. LCASquared simply applies the LCA method again to the set of taxonomic ORF annotations on a contig. Simple Majority ascribes the taxonomy of the contig to the taxonomy that has the greatest number of annotations. Our LCA* method applies our information-theoetic result and algorithm previously described with a the default majority theshold ($\alpha=0.51$).
 
 We evaluated the performance of these predictions using two taxonomic distances on the NCBI taxonomic database hierarchy (NCBI Tree) [[@Federhen:2012jx]]. The first is a Simple Walk of the the NCBI Tree from the observed predicted taxonomy to the original expected taxonomy. The second is a weighted taxonomic distance (WTD) that weightes each edge proportional to $\frac{1}{d}$ where $d$ is the depth of the edge in the tree (see [[@Hanson:2014bz], [@Konwar:2015vh]]) for more details. The NCBI Tree was slightly modified with the addition of a *prokaryotes* node as a parent of *Bacteria* and *Archaea* nodes.
 
+## Materials and Downloads
+
+Up-to-date code is available on the [GitHub repository](https://github.com/hallamlab/lcastar). Data required for this analysis can be found: [lca_star_data.zip](https://www.dropbox.com/s/ew5v55eroa332mu/lca_star_data.zip?dl=0)
+
 # Simulations
 
-Two simulated metagenomes were created, Small and Large, using 10,000bp contigs randomly sampled from a collection of 2617 genomes obtained from the NCBI (Downloaded March 15 2014 <ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/>) using the python script [subsample_ncbi.py](). The first simulation, Small, is a smaller sample of 100 genomes, sampling 10 random reads from each genome. The second simulation, Large, samples a random subset of 2,000 genomes, sampling 10 random reads from each genome.
+Two simulated metagenomes, Small and Large, were created by randomly sampling 10,000bp `contigs' from a collection of 2617 genomes obtained from the NCBI (Downloaded March 15 2014 <ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/>) using the Python script [subsample_ncbi.py](../python_resources/subsample_ncbi.py). The first simulation, Small, is a smaller sample of 100 genomes, sampling 10 random reads from each genome. The second simulation, Large, samples a random subset of 2,000 genomes, sampling 10 random reads from each genome.
 
 ## Obtaining NCBI Genomes
 
 * Downloaded `all.fna.tar.gz` and `summary.txt` from the NCBI's ftp server: <ftp://ftp.ncbi.nlm.nih.gov/genomes/> Mar 12, 2014
-* `summary.txt` contains some metadata for a number of these genomes:
+* The file `summary.txt` contains some metadata for a number of these genomes:
 
 ```
 Accession  GenbankAcc  Length	Taxid	ProjectID	TaxName	Replicon	Create Date	Update_Date
 NC_000907.1	L42023.1	1830138	71421	57771	Haemophilus influenzae Rd KW20	chromosome 	Oct 19 2001	Sep 11 2013  4:30:20:076PM
 ```
 
+* Extract NCBI IDs
+
 ```
 cat summary.txt  | awk '{print $1}' > u
 for s in `cat u`; do var1=`find . -name *${s%.[0-9]}*`; echo $s$'\t'$var1 >> ncbiID_to_file; done
@@ -92,7 +98,7 @@ NC_015557.1
 NC_015587.1
 ```
 
-* this leaves a total of 2617 genomes in `ncbiID_to_file.txt`, which maps the NCBI IDs to the .fna file location
+* This leaves a total of 2617 genomes in `ncbiID_to_file.txt`, which maps the NCBI IDs to the .fna file location
 
 ```
 grep --perl-regexp ".*fna" ncbiID_to_file | wc
@@ -104,7 +110,7 @@ grep --perl-regexp ".*fna" ncbiID_to_file > ncbiID_to_file.txt
 
 ## Simulation
 
-Wrote my own script `subsample_ncbi.py` to quickly sample from a collection of fasta files specified in our `ncbiID_to_file.txt` that we created above. We use this to compute the `test1` or Small and the `test2` or Large simulated metagenomes.
+The script [subsample_ncbi.py](../python_resources/subsample_ncbi.py) was used to quickly sample from a collection of fasta files specified in our `ncbiID_to_file.txt`. This script creates the `test1` or Small and the `test2` or Large simulated metagenomes.
 
 ### Small simulation
 
@@ -128,7 +134,7 @@ Creating our Large simulated contigs file: `lca_star_test2.fasta`
 
 # GEBA SAGs
 
-The 201 Single-cell amplified genome (SAG) microbial dark matter assembies were obtained from <https://img.jgi.doe.gov> under the Study Name `GEBA-MDM` [[@Rinke:2013bt]]. Combined assemblies were excluded from this analysis.
+The 201 Single-cell amplified genome (SAG) microbial 'dark matter' assembies were obtained from <https://img.jgi.doe.gov> under the Study Name `GEBA-MDM` [[@Rinke:2013bt]]. Combined assemblies were excluded from this analysis.
 
 # MetaPathways Annotation
 
@@ -149,11 +155,11 @@ The script [run_lca_star_analysis.sh](run_lca_star_analysis.sh) runs the Small,
 
 * Note: base paths may need to be updated
 
-The outputs of these runs can be found in [lca_star_data/](lca_star_data/):
+The outputs of these runs can be found in [lca_star_data.zip](https://www.dropbox.com/s/ew5v55eroa332mu/lca_star_data.zip?dl=0):
 
-* [test1_lcastar.txt](lca_star_data/test1_lcastar.txt): LCAStar output for the Small simulation
-* [test2_lcastar.txt](lca_star_data/test2_lcastar.txt): LCAStar output for the Large simulation
-* [GEBA_SAG_all_lcastar.txt](lca_star_data/GEBA_SAG_all_lcastar.txt): LCAStar output for the 201 GEBA SAGs
+* `test1_lcastar.txt`: LCAStar output for the Small simulation
+* `test2_lcastar.txt`: LCAStar output for the Large simulation
+* `GEBA_SAG_all_lcastar.txt`: LCAStar output for the 201 GEBA SAGs
 
 # Analysis of LCAStar Results
 
@@ -199,12 +205,12 @@ dev.off()
 
 Notes:
 
-* all three samples exhibit the same pattern, suggesting that the simulation procecture is producing representative results at least in terms of the LCA* and Majority summary statistics 
-* in many cases the simple Majority and LCA* solutions are the same, as shown by the strong linear trend
-* when in disagreement LCA* has more favorable (lower) p-values than the simple majority, in many cases the p-value of the LCA* solution is significantly smaller than the simple majority one
-* disagreements where $p_{Majority} < p_{LCA^*}$ occur when the LCA* transformation widens the difference between the first and second place leaders, although this is a bit of a corner case, and happens fairly infrequently
-* the majority has a degenerate case where there is as tie in the $M = X_n$, which is a degenerate case, meaning that the Simple majority had to make an arbitary choise between two or more taxonomic candidates
-* Suggests that if you subscribe to the p-value argument, LCA* produces more reliable results, although one can argue that his happens by construction.
+* All three samples exhibit the general distribution pattern, suggesting that the simulation procecture is producing representative results at least in terms of the LCA* and Majority summary statistics 
+* In many cases the simple Majority and LCA* solutions are the same, many values are along a perfect diagonal
+* When LCA* and simple Majority disagree, LCA* has more favorable (lower) p-values than the simple majority and in many cases the p-value of the LCA* solution is significantly smaller than the simple Majority one
+* Disagreements where $p_{Majority} < p_{LCA^*}$ occur when the LCA* transformation widens the difference between the first and second place leaders (this is a bit of a corner case and happens infrequently)
+* The Majority method has a degenerate case where there is as tie in the $M = X_n$ meaning that the Simple majority had to make an arbitary choise between two or more taxonomic candidates
+* Results suggest that according to p-value LCA* produces more reliable results
 
 Function to format and reshape data for ggplot:
 
@@ -222,8 +228,9 @@ clean_up_data2 <- function(df) {
 }
 ```
 
+* Reshape data into ggplot form
+
 ```{r}
-# reshape data into ggplot form
 test1_df.m <- clean_up_data2(test1_df)
 test1_df.m<- cbind(test1_df.m, Sample="Small")
 test2_df.m <- clean_up_data2(test2_df)
@@ -234,7 +241,7 @@ geba_df.m <- cbind(geba_df.m, Sample="GEBA MDM")
 all_df.m <- rbind(test1_df.m, test2_df.m, geba_df.m)
 ```
 
-* plot the distribution of distances between predictions and taxonomic origin
+* Plot the distribution of distances between predictions and taxonomic origin
 
 ```{r fig.height=7, fig.width=7}
 # plotting parameters
@@ -428,5 +435,7 @@ dev.off()
 
 **Figure 5:** p-value distribution of the Majority and LCA* predictions.
 
+Please direct any questions or comments to Dr. Steven J. Hallam (<[email protected]>), Department of Microbiology & Immunology, University of British Columbia.
+
 ## References
 
diff --git a/lca_star_analysis/LCAStar.bib b/lca_star_analysis/LCAStar.bib
@@ -51,6 +51,15 @@ @article{Konwar:2013cw
 pages = {202}
 }
 
+@article{Hanson:2014bz,
+author = {Hanson, Niels W and Konwar, Kishori M and Hawley, Alyse K and Altman, Tomer and Karp, Peter D and Hallam, Steven J},
+title = {{Metabolic pathways for the whole community.}},
+journal = {BMC Genomics},
+year = {2014},
+volume = {15},
+pages = {619}
+}
+
 @article{Hanson2014,
 author = {Hanson, Niels W and Konwar, Kishori M and Wu, Shang-Ju and Hallam, Steven J},
 title = {{MetaPathways v2.0: A master-worker model for environmental Pathway/Genome Database construction on grids and clouds}},