Finished rmd of analysis up to MetaPathways output. Running of LCASta…

…r needs cleaning up
hallamlab · Mar 25, 2015 · d32380c · d32380c
1 parent 36dbbd9
commit d32380c
Show file tree

Hide file tree

Showing 54 changed files with 19,532 additions and 503 deletions.
diff --git a/.gitignore b/.gitignore
@@ -45,3 +45,9 @@ lca_star_test2*
 .Rapp.history
 .Rhistory
 .idea/
+
+# data
+data
+LCAStar_data
+lca_star_data
+.Rproj.user
diff --git a/Compute_LCAStar.py b/Compute_LCAStar.py
@@ -377,7 +377,7 @@ def main(argv):
                     if taxa_hits:
                         taxa = taxa_hits.group(1)
                         bitscore = fields[3]
-                        contig_to_taxa[contig][orf].append( (taxa, int(bitscore)) )
+                        contig_to_taxa[contig][orf].append( (taxa, float(str(bitscore))) )
                     else:
                         continue
                 else:

diff --git a/lca_star_geba_analysis/LCAStar.Rmd → lca_star_analysis/LCAStar.Rmd b/lca_star_geba_analysis/LCAStar.Rmd → lca_star_analysis/LCAStar.Rmd
@@ -3,43 +3,59 @@ title: "LCA Star Methods"
 author: "Niels W. Hanson"
 date: "Friday, November 28, 2014"
 output:
-   html_document: 
-      toc: true
-      theme: readable
-      highlight: default
+  html_document:
+    keep_md: yes
+    number_sections: yes
+    theme: readable
+    toc: yes
 csl: ieee.csl
 bibliography: LCAStar.bib
 ---
 
 ```{r global_options, include=FALSE}
-require(knitr)
-opts_chunk$set(warning=FALSE, message=FALSE, dev = 'pdf')
+library(knitr)
+opts_chunk$set(warning=FALSE, message=FALSE)
 ```
 
-## Overview 
+# Preample
 
-Two simulated metagenomes were created using 10,000bp contigs, randomly sampled from a collection of 2713 genomes obtained from the NCBI(Downloaded March 15 2014 <ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/>) using the python script `subsample_ncbi.py`. The first simulation is a smaller sample of 100 genomes, sampling 10 random reads from each genome. The second simulation is larger, sampling a random subset of 2,000 genomes, sampling 10 random reads from each. Each simulation had its ORFs predicted and annotated against the RefSeq database using the MetaPathways pipeline [[@Konwar:2013cw, @Hanson2014]]. The original source of each contig was predicted from the taxonomic annotations ascribed to each LCA-predicted ORF using three different methods: $LCA^2$, $Simple Majority$, and our information-theoretic $LCA*$. $LCA^2$ simply applies the LCA algorithm again to the set of contig taxonomies. The $Simple Majority$ method ascribes the taxonomy of the contig to the taxonomy that has the greatest majority. Our $LCA*$ method applies the information-theoetic result and algorithm previously described with a majority theshold set to the default majority ($\alpha=0.5$).
+* load required R packages
 
-We evaluated the performance of these predictions using two taxonomic distances on the NCBI taxonomy tree. First a simple walk on the NCBI Taxonomy Hierarchy from the observed predicted taxonomy to the original expected taxonomy. The second is a weighted taxonomic distance that weightes each edge proportional to $\frac{1}{d}$ where $d$ is the depth of the edge in the tree. See Hanson *et al.* (2014) for more details. The NCBI Taxonomy Hierarchy was modified with the additional *prokaryotes* node as a parent of *Bacteria* and *Archaea* nodes.
+```{r}
+require(ggplot2)
+require(reshape2)
+require(dplyr)
+theme_set(theme_bw()) # set theme and font
+```
+
+# Overview
+
+This document summarizes the analysis performed in "LCA*: an entropy-based measure for taxonomic assignment within assembled metagenomes" describing the use of the simulation the Small and Large metagenomes from a coolection of genomes downloaded from the NCBI, the running of the MetaPathways pipeline  [[@Konwar:2013cw], [@Hanson2014]], use of the `LCAStar.py` python module for the calculation of the LCAStar, Simple Majority, LCA^2 taxonomic estimation methods. 
+
+Each simulation had its ORFs predicted and annotated against the RefSeq database using the MetaPathways pipeline. The original source of each contig was predicted from the taxonomic annotations ascribed to each LCA-predicted ORF using three different methods: $LCA^2$, $Simple Majority$, and our information-theoretic LCAStar. LCA^2 simply applies the LCA algorithm again to the set of ORF taxonomies. The $Simple Majority$ method ascribes the taxonomy of the contig to the taxonomy that has the greatest majority. Our $LCA*$ method applies the information-theoetic result and algorithm previously described with a majority theshold set to the default majority ($\alpha=0.5$).
+
+We evaluated the performance of these predictions using two taxonomic distances on the NCBI taxonomy tree. First a simple walk on the NCBI Taxonomy Hierarchy from the observed predicted taxonomy to the original expected taxonomy. The second is a weighted taxonomic distance that weightes each edge proportional to $\frac{1}{d}$ where $d$ is the depth of the edge in the tree (see [[@Hanson2014]]) for more details. The NCBI Taxonomy Hierarchy was modified with the additional *prokaryotes* node as a parent of *Bacteria* and *Archaea* nodes.
+
+# Simulation
 
-### Obtaining NCBI Genomes
+Two simulated metagenomes were created, Small Large, using 10,000bp contigs, randomly sampled from a collection of 2713 genomes obtained from the NCBI (Downloaded March 15 2014 <ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/>) using the python script `subsample_ncbi.py`. The first simulation, Small, is a smaller sample of 100 genomes, sampling 10 random reads from each genome. The second simulation, Large, samples a random subset of 2,000 genomes, sampling 10 random reads from each genome.
 
-* Downloaded `all.fna.tar.gz` and `summary.txt` from the NCBI's ftp server:ftp://ftp.ncbi.nlm.nih.gov/genomes/ Mar 12, 2014
+## Obtaining NCBI Genomes
+
+* Downloaded `all.fna.tar.gz` and `summary.txt` from the NCBI's ftp server: <ftp://ftp.ncbi.nlm.nih.gov/genomes/> Mar 12, 2014
 * `summary.txt` contains some metadata for a number of these genomes:
 
 ```
 Accession  GenbankAcc  Length	Taxid	ProjectID	TaxName	Replicon	Create Date	Update_Date
 NC_000907.1	L42023.1	1830138	71421	57771	Haemophilus influenzae Rd KW20	chromosome 	Oct 19 2001	Sep 11 2013  4:30:20:076PM
 ```
 
-* this kind of information would be useful for an analysis so I cross referenced all the Accession IDs with the actual `.fna` files that I had in `all.fna.tar.gz` with the following shell command:
-
 ```
 cat summary.txt  | awk '{print $1}' > u
 for s in `cat u`; do var1=`find . -name *${s%.[0-9]}*`; echo $s$'\t'$var1 >> ncbiID_to_file; done
 ```
 
-* it turns out that this is a fairly comprehensive list 25 genomes were dropped because they were not found in the MetaData
+* `summary.txt` is a fairly comprehensive list, but 25 genomes were dropped because they were not found in the `summary.txt` metadata file
 
 ```
 grep --perl-regexp "\t$" ncbiID_to_file | wc
@@ -71,81 +87,70 @@ NC_015557.1
 NC_015587.1
 ```
 
-* this leaves us with 2617 genomes with annotation of 2642, storing these in `ncbiID_to_file.txt`
+* this leaves a total of 2617 genomes in `ncbiID_to_file.txt`, which maps the NCBI IDs to the .fna file location
 
 ```
-grep --perl-regexp ".*fna" ncbiID_to_file2 | wc
+grep --perl-regexp ".*fna" ncbiID_to_file | wc
     2617    5234  197129
-wc ncbiID_to_file2
-    2642    5259  197453 ncbiID_to_file2
-grep --perl-regexp ".*fna" ncbiID_to_file2 > ncbiID_to_file.txt
+wc ncbiID_to_file
+    2642    5259  197453 ncbiID_to_file
+grep --perl-regexp ".*fna" ncbiID_to_file > ncbiID_to_file.txt
 ```
 
-### Creating Simulated Metagenomes
+## Simulation
 
-Wrote my own script `subsample_ncbi.py` quickly sample from a collection of fasta files specified in our `ncbiID_to_file.txt` that we created above.
+Wrote my own script `subsample_ncbi.py` to quickly sample from a collection of fasta files specified in our `ncbiID_to_file.txt` that we created above. We use this to compute the `test1` or Small and the `test2` or Large simulated metagenomes.
 
-The following example creates sub-sequences of length 10,000, sampling 10 sequences per file on a random subset of 100 (the random number generator is seeded so that results can be reproducible)
-
-```
-python subsample_ncbi.py -i ncbiID_to_file.txt -l 10000 -s 10 -n 100 -o lca_star_test1.fna
-```
+### Small simulation
 
-* lca_star_test1.fna is going to be our first test, we ran it through MetaPathways with the standard settings:
+Here we create the `test1` (or Small simulation as it is referred to in the text). This script creates sub-sequences of length 10,000, sampling 10 sequences per file on a random subset of 100 (the random number generator is seeded so that results can be reproducible)
 
 ```
-Run Date : 2014-03-15 
-Nucleotide Quality Control parameters
-  min length  180
-ORF prediction parameters
-  min length	60
-  algorithm	prodigal
-Amino acid quality control and annotation parameters
-  min bit score	20
-  min seq length	60
-  annotation reference dbs	RefSeq_complete_nr_v62_dec2013
-  min BSR	0.4
-  max evalue	0.000001
-Pathway Tools parameters
-  taxonomic pruning 	no
-rRNA search/match parameters
-  min identity	20
-  max evalue	0.000001
-  rRNA reference dbs	GREENGENES_gg16S-2013-09-12
+python subsample_ncbi.py -i ncbiID_to_file.txt -l 10000 -s 10 -n 100 -o lca_star_test1.fasta
 ```
 
-#### Second test
+Creating our Small simulated contigs file: `lca_star_test1.fasta`
+
+### Large simulation
 
 For a second test we sampled sequences of 10,000 bps using 10 subsamples from 2000 randomly selected genomes. Here, did the same procedure except with significantly more samples.
 
 ```
-python ../../subsample_ncbi.py -i ../ncbiID_to_file.txt -o lca_test2.fasta -l 10000 -s 10 -n 2000
+python subsample_ncbi.py -i ncbiID_to_file.txt -o lca_test2.fasta -l 10000 -s 10 -n 2000
 ```
 
-### Simulations and GEBA SAGs
+Creating our Large simulated contigs file: `lca_star_test1.fasta`
+
+# GEBA SAGs
+
+The 201 Single-cell amplified genome (SAG) microbial dark matter assembies were obtained from <https://img.jgi.doe.gov> under the Study Name `GEBA-MDM` [[@Rinke]]. Combined assemblies were excluded from this analysis.
+
+# MetaPathways Annotation
+
+The Small, Large, and 201 MDM SAG assemblies were annotated against the RefSeq v62 via MetaPathways v2.0 pipeline [REF] using standard quality control settings and the LAST homology-search algorithm [[@Kieibasa:2011do]].
+
+The outputs of these runs can be found in:
+
+* [lca_star_data/lca_star_simulations/](lca_star_data/lca_star_simulations/): F
+
+# Running LCAStar
 
-* Load required libraries
 
-```{r}
-library(ggplot2)
-library(reshape2)
-library(dplyr)
-theme_set(theme_bw()) # set theme and font
-```
 
-* read and merge the three datasets
+# Analysis of LCAStar Results
+
+* Read and merge the three datasets
 
 ```{r}
-setwd("~/Dropbox/projects/LCAStar/lca_star_geba_analysis")
 colClasses <- c("character", "character", "numeric", "numeric", "numeric",
                 "character", "numeric", "numeric", "numeric", 
                 "character", "numeric", "numeric", "character" )
-geba_df <- read.table("GEBA_SAG_all_lcastar.txt", sep="\t", header=T, na.strings = "None", 
-                      colClasses = colClasses, strip.white=TRUE, quote="")
 test1_df <- read.table("test1_lcastar.txt", sep="\t", header=T, na.strings = "None", 
                       colClasses = colClasses, strip.white=TRUE, quote="")
 test2_df <- read.table("test2_lcastar.txt", sep="\t", header=T, na.strings = "None", 
                       colClasses = colClasses, strip.white=TRUE, quote="")
+geba_df <- read.table("GEBA_SAG_all_lcastar.txt", sep="\t", header=T, na.strings = "None", 
+                      colClasses = colClasses, strip.white=TRUE, quote="")
 
 all_df <- rbind(cbind(geba_df, Sample="GEBA"),cbind(test1_df, Sample="Small"),cbind(test2_df, Sample="Large"))
 
@@ -318,6 +323,7 @@ g3a
 Treating the whole thing as a regression problem we have the following global averages.
 
 ```{r fig.width=5.49, fig.height=4.43}
+# function for RMSE
 rmse <- function(x) {
   sqrt(mean(x^2, na.rm = T))
 }
@@ -335,6 +341,8 @@ ci <- function(x, sig=0.05) {
   x_ci <- x_se * ciMult
   x_ci
 }
+
+# function to compute standard errors
 summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
                       conf.interval=.95, .drop=TRUE) {
     require(plyr)
@@ -414,7 +422,6 @@ dev.off()
 
 **Figure 5:** p-value distribution of the Majority and LCA* predictions.
 
-
 ```{r include=FALSE}
 g8 <- ggplot(subset(geba_df.m, variable == "Walk"), aes(x=value, fill=Method))
 g8 <- g8 + geom_histogram(binwidth=2, alpha=0.8) 

diff --git a/lca_star_geba_analysis/LCAStar.bib → lca_star_analysis/LCAStar.bib b/lca_star_geba_analysis/LCAStar.bib → lca_star_analysis/LCAStar.bib
diff --git a/lca_star_analysis/LCAStar.html b/lca_star_analysis/LCAStar.html