Skip to content

Commit

Permalink
Finished rmd of analysis up to MetaPathways output. Running of LCASta…
Browse files Browse the repository at this point in the history
…r needs cleaning up
  • Loading branch information
nielshanson committed Mar 25, 2015
1 parent 36dbbd9 commit d32380c
Show file tree
Hide file tree
Showing 54 changed files with 19,532 additions and 503 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,9 @@ lca_star_test2*
.Rapp.history
.Rhistory
.idea/

# data
data
LCAStar_data
lca_star_data
.Rproj.user
2 changes: 1 addition & 1 deletion Compute_LCAStar.py
Original file line number Diff line number Diff line change
Expand Up @@ -377,7 +377,7 @@ def main(argv):
if taxa_hits:
taxa = taxa_hits.group(1)
bitscore = fields[3]
contig_to_taxa[contig][orf].append( (taxa, int(bitscore)) )
contig_to_taxa[contig][orf].append( (taxa, float(str(bitscore))) )
else:
continue
else:
Expand Down
127 changes: 67 additions & 60 deletions lca_star_geba_analysis/LCAStar.Rmd → lca_star_analysis/LCAStar.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,43 +3,59 @@ title: "LCA Star Methods"
author: "Niels W. Hanson"
date: "Friday, November 28, 2014"
output:
html_document:
toc: true
theme: readable
highlight: default
html_document:
keep_md: yes
number_sections: yes
theme: readable
toc: yes
csl: ieee.csl
bibliography: LCAStar.bib
---

```{r global_options, include=FALSE}
require(knitr)
opts_chunk$set(warning=FALSE, message=FALSE, dev = 'pdf')
library(knitr)
opts_chunk$set(warning=FALSE, message=FALSE)
```

## Overview
# Preample

Two simulated metagenomes were created using 10,000bp contigs, randomly sampled from a collection of 2713 genomes obtained from the NCBI(Downloaded March 15 2014 <ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/>) using the python script `subsample_ncbi.py`. The first simulation is a smaller sample of 100 genomes, sampling 10 random reads from each genome. The second simulation is larger, sampling a random subset of 2,000 genomes, sampling 10 random reads from each. Each simulation had its ORFs predicted and annotated against the RefSeq database using the MetaPathways pipeline [[@Konwar:2013cw, @Hanson2014]]. The original source of each contig was predicted from the taxonomic annotations ascribed to each LCA-predicted ORF using three different methods: $LCA^2$, $Simple Majority$, and our information-theoretic $LCA*$. $LCA^2$ simply applies the LCA algorithm again to the set of contig taxonomies. The $Simple Majority$ method ascribes the taxonomy of the contig to the taxonomy that has the greatest majority. Our $LCA*$ method applies the information-theoetic result and algorithm previously described with a majority theshold set to the default majority ($\alpha=0.5$).
* load required R packages

We evaluated the performance of these predictions using two taxonomic distances on the NCBI taxonomy tree. First a simple walk on the NCBI Taxonomy Hierarchy from the observed predicted taxonomy to the original expected taxonomy. The second is a weighted taxonomic distance that weightes each edge proportional to $\frac{1}{d}$ where $d$ is the depth of the edge in the tree. See Hanson *et al.* (2014) for more details. The NCBI Taxonomy Hierarchy was modified with the additional *prokaryotes* node as a parent of *Bacteria* and *Archaea* nodes.
```{r}
require(ggplot2)
require(reshape2)
require(dplyr)
theme_set(theme_bw()) # set theme and font
```

# Overview

This document summarizes the analysis performed in "LCA*: an entropy-based measure for taxonomic assignment within assembled metagenomes" describing the use of the simulation the Small and Large metagenomes from a coolection of genomes downloaded from the NCBI, the running of the MetaPathways pipeline [[@Konwar:2013cw], [@Hanson2014]], use of the `LCAStar.py` python module for the calculation of the LCAStar, Simple Majority, LCA^2 taxonomic estimation methods.

Each simulation had its ORFs predicted and annotated against the RefSeq database using the MetaPathways pipeline. The original source of each contig was predicted from the taxonomic annotations ascribed to each LCA-predicted ORF using three different methods: $LCA^2$, $Simple Majority$, and our information-theoretic LCAStar. LCA^2 simply applies the LCA algorithm again to the set of ORF taxonomies. The $Simple Majority$ method ascribes the taxonomy of the contig to the taxonomy that has the greatest majority. Our $LCA*$ method applies the information-theoetic result and algorithm previously described with a majority theshold set to the default majority ($\alpha=0.5$).

We evaluated the performance of these predictions using two taxonomic distances on the NCBI taxonomy tree. First a simple walk on the NCBI Taxonomy Hierarchy from the observed predicted taxonomy to the original expected taxonomy. The second is a weighted taxonomic distance that weightes each edge proportional to $\frac{1}{d}$ where $d$ is the depth of the edge in the tree (see [[@Hanson2014]]) for more details. The NCBI Taxonomy Hierarchy was modified with the additional *prokaryotes* node as a parent of *Bacteria* and *Archaea* nodes.

# Simulation

### Obtaining NCBI Genomes
Two simulated metagenomes were created, Small Large, using 10,000bp contigs, randomly sampled from a collection of 2713 genomes obtained from the NCBI (Downloaded March 15 2014 <ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/>) using the python script `subsample_ncbi.py`. The first simulation, Small, is a smaller sample of 100 genomes, sampling 10 random reads from each genome. The second simulation, Large, samples a random subset of 2,000 genomes, sampling 10 random reads from each genome.

* Downloaded `all.fna.tar.gz` and `summary.txt` from the NCBI's ftp server:ftp://ftp.ncbi.nlm.nih.gov/genomes/ Mar 12, 2014
## Obtaining NCBI Genomes

* Downloaded `all.fna.tar.gz` and `summary.txt` from the NCBI's ftp server: <ftp://ftp.ncbi.nlm.nih.gov/genomes/> Mar 12, 2014
* `summary.txt` contains some metadata for a number of these genomes:

```
Accession GenbankAcc Length Taxid ProjectID TaxName Replicon Create Date Update_Date
NC_000907.1 L42023.1 1830138 71421 57771 Haemophilus influenzae Rd KW20 chromosome Oct 19 2001 Sep 11 2013 4:30:20:076PM
```

* this kind of information would be useful for an analysis so I cross referenced all the Accession IDs with the actual `.fna` files that I had in `all.fna.tar.gz` with the following shell command:

```
cat summary.txt | awk '{print $1}' > u
for s in `cat u`; do var1=`find . -name *${s%.[0-9]}*`; echo $s$'\t'$var1 >> ncbiID_to_file; done
```

* it turns out that this is a fairly comprehensive list 25 genomes were dropped because they were not found in the MetaData
* `summary.txt` is a fairly comprehensive list, but 25 genomes were dropped because they were not found in the `summary.txt` metadata file

```
grep --perl-regexp "\t$" ncbiID_to_file | wc
Expand Down Expand Up @@ -71,81 +87,70 @@ NC_015557.1
NC_015587.1
```

* this leaves us with 2617 genomes with annotation of 2642, storing these in `ncbiID_to_file.txt`
* this leaves a total of 2617 genomes in `ncbiID_to_file.txt`, which maps the NCBI IDs to the .fna file location

```
grep --perl-regexp ".*fna" ncbiID_to_file2 | wc
grep --perl-regexp ".*fna" ncbiID_to_file | wc
2617 5234 197129
wc ncbiID_to_file2
2642 5259 197453 ncbiID_to_file2
grep --perl-regexp ".*fna" ncbiID_to_file2 > ncbiID_to_file.txt
wc ncbiID_to_file
2642 5259 197453 ncbiID_to_file
grep --perl-regexp ".*fna" ncbiID_to_file > ncbiID_to_file.txt
```

### Creating Simulated Metagenomes
## Simulation

Wrote my own script `subsample_ncbi.py` quickly sample from a collection of fasta files specified in our `ncbiID_to_file.txt` that we created above.
Wrote my own script `subsample_ncbi.py` to quickly sample from a collection of fasta files specified in our `ncbiID_to_file.txt` that we created above. We use this to compute the `test1` or Small and the `test2` or Large simulated metagenomes.

The following example creates sub-sequences of length 10,000, sampling 10 sequences per file on a random subset of 100 (the random number generator is seeded so that results can be reproducible)

```
python subsample_ncbi.py -i ncbiID_to_file.txt -l 10000 -s 10 -n 100 -o lca_star_test1.fna
```
### Small simulation

* lca_star_test1.fna is going to be our first test, we ran it through MetaPathways with the standard settings:
Here we create the `test1` (or Small simulation as it is referred to in the text). This script creates sub-sequences of length 10,000, sampling 10 sequences per file on a random subset of 100 (the random number generator is seeded so that results can be reproducible)

```
Run Date : 2014-03-15
Nucleotide Quality Control parameters
min length 180
ORF prediction parameters
min length 60
algorithm prodigal
Amino acid quality control and annotation parameters
min bit score 20
min seq length 60
annotation reference dbs RefSeq_complete_nr_v62_dec2013
min BSR 0.4
max evalue 0.000001
Pathway Tools parameters
taxonomic pruning no
rRNA search/match parameters
min identity 20
max evalue 0.000001
rRNA reference dbs GREENGENES_gg16S-2013-09-12
python subsample_ncbi.py -i ncbiID_to_file.txt -l 10000 -s 10 -n 100 -o lca_star_test1.fasta
```

#### Second test
Creating our Small simulated contigs file: `lca_star_test1.fasta`

### Large simulation

For a second test we sampled sequences of 10,000 bps using 10 subsamples from 2000 randomly selected genomes. Here, did the same procedure except with significantly more samples.

```
python ../../subsample_ncbi.py -i ../ncbiID_to_file.txt -o lca_test2.fasta -l 10000 -s 10 -n 2000
python subsample_ncbi.py -i ncbiID_to_file.txt -o lca_test2.fasta -l 10000 -s 10 -n 2000
```

### Simulations and GEBA SAGs
Creating our Large simulated contigs file: `lca_star_test1.fasta`

# GEBA SAGs

The 201 Single-cell amplified genome (SAG) microbial dark matter assembies were obtained from <https://img.jgi.doe.gov> under the Study Name `GEBA-MDM` [[@Rinke]]. Combined assemblies were excluded from this analysis.

# MetaPathways Annotation

The Small, Large, and 201 MDM SAG assemblies were annotated against the RefSeq v62 via MetaPathways v2.0 pipeline [REF] using standard quality control settings and the LAST homology-search algorithm [[@Kieibasa:2011do]].

The outputs of these runs can be found in:

* [lca_star_data/lca_star_simulations/](lca_star_data/lca_star_simulations/): F

# Running LCAStar

* Load required libraries

```{r}
library(ggplot2)
library(reshape2)
library(dplyr)
theme_set(theme_bw()) # set theme and font
```

* read and merge the three datasets
# Analysis of LCAStar Results

* Read and merge the three datasets

```{r}
setwd("~/Dropbox/projects/LCAStar/lca_star_geba_analysis")
colClasses <- c("character", "character", "numeric", "numeric", "numeric",
"character", "numeric", "numeric", "numeric",
"character", "numeric", "numeric", "character" )
geba_df <- read.table("GEBA_SAG_all_lcastar.txt", sep="\t", header=T, na.strings = "None",
colClasses = colClasses, strip.white=TRUE, quote="")
test1_df <- read.table("test1_lcastar.txt", sep="\t", header=T, na.strings = "None",
colClasses = colClasses, strip.white=TRUE, quote="")
test2_df <- read.table("test2_lcastar.txt", sep="\t", header=T, na.strings = "None",
colClasses = colClasses, strip.white=TRUE, quote="")
geba_df <- read.table("GEBA_SAG_all_lcastar.txt", sep="\t", header=T, na.strings = "None",
colClasses = colClasses, strip.white=TRUE, quote="")
all_df <- rbind(cbind(geba_df, Sample="GEBA"),cbind(test1_df, Sample="Small"),cbind(test2_df, Sample="Large"))
Expand Down Expand Up @@ -318,6 +323,7 @@ g3a
Treating the whole thing as a regression problem we have the following global averages.

```{r fig.width=5.49, fig.height=4.43}
# function for RMSE
rmse <- function(x) {
sqrt(mean(x^2, na.rm = T))
}
Expand All @@ -335,6 +341,8 @@ ci <- function(x, sig=0.05) {
x_ci <- x_se * ciMult
x_ci
}
# function to compute standard errors
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
conf.interval=.95, .drop=TRUE) {
require(plyr)
Expand Down Expand Up @@ -414,7 +422,6 @@ dev.off()

**Figure 5:** p-value distribution of the Majority and LCA* predictions.


```{r include=FALSE}
g8 <- ggplot(subset(geba_df.m, variable == "Walk"), aes(x=value, fill=Method))
g8 <- g8 + geom_histogram(binwidth=2, alpha=0.8)
Expand Down
File renamed without changes.
460 changes: 460 additions & 0 deletions lca_star_analysis/LCAStar.html

Large diffs are not rendered by default.

Loading

0 comments on commit d32380c

Please sign in to comment.