Skip to content

Commit

Permalink
Merge pull request #3 from hallamlab/redux
Browse files Browse the repository at this point in the history
Redux for conda install
  • Loading branch information
Tony-xy-Liu authored May 29, 2024
2 parents 979b024 + f8be90d commit 384d8a9
Show file tree
Hide file tree
Showing 112 changed files with 3,325 additions and 60 deletions.
41 changes: 19 additions & 22 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,25 +1,23 @@
*.py[cod]

# C extensions
*.so

# Packages
*.egg
*.egg-info
dist
build
eggs
parts
bin
var
sdist
develop-eggs
.installed.cfg
lib
lib64
# folders
/data
/secrets
/scratch
/notebooks

# temp files and caches
cache
__pycache__
conda_*
notebooks
.ipynb_checkpoints
*.sif
*.egg-info

# conda package build
/dist
/build
/conda_build
/conda_recipe/*
!/conda_recipe/compile_recipe.py
!/conda_recipe/meta_template.yaml

# Installer logs
pip-log.txt
Expand Down Expand Up @@ -53,4 +51,3 @@ data
LCAStar_data
lca_star_data
.Rproj.user
secrets
674 changes: 674 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

103 changes: 65 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,50 +3,77 @@ LCAStar: an entropy-based measure for taxonomic assignment within assembled meta

Niels W. Hanson, Kishori M. Konwar, Steven J. Hallam

![lca_star_logo.png](lca_star_logo.png)
![lca_star_logo.png](legacy/lca_star_logo.png)

## Abstract

A perennial problem in the analyses of large meta'omic datasets is the taxonomic classification of unknown reads or assembled contigs to their likely taxa of origin. Although the assembly of metagenomic samples has its difficulties, once contigs are found it is often important to classify them to a taxonomy based on their ORF annotations. The popular Lowest Common Ancestor (LCA) algorithm addresses a similar problem with ORF annotations, and it is intuitive to apply the same taxonomic annotation procedure to the annotation of contigs, a procedure we call LCA2. Inspired by Information and Voting Theory we developed an alternative statistics LCA\* by viewing the taxonomic classification problem as an election among the different taxonomic annotations, and gen- eralize an algorithm to obtain a sufficiently strong majority α-majority while respecting the entropy of the taxonomic distribution and phylogeny tree-structure of the NCBI Taxonomic Database. Further, using results from order and supremacy statistics, we formulate a likelihood-ratio hypothesis test and p-value for testing the supremacy of the final reported taxonomy. In simulated metage- nomic config experiments, we emperically demonstrate that voting-based methods, majority vote and LCA\*, are significantly more accurate than LCA2, and that in many cases LCA\* is superior to the simple majority vote procedure. LCA\* and its statistical tests have been implemented as a stand-alone Python library, and have been integrated into the latest release of the [MetaPathways pipeline](https://github.com/hallamlab/metapathways2).

## Installation

LCA\* is released as as Python library, requiring Python 2.6 or greater. More installation and useage information can be found on the wiki.

## Contents

* [Compute_LCAStar.py](Compute_LCAStar.py): Driver script for running LCAStar.py

* Usage:

```
python Compute_LCAStar.py -i blast_results/refseq.*.parsed.txt \
-m preprocessed/*.mapping.txt \
--ncbi_tree resources/ncbi_taxonomy_tree.txt \
--ncbi_megan_map resources/ncbi.map \
-a \
-v \
--contig_taxa_ref ...contigmap.txt \
-o LCAStar.output.txt
```
where,
* `-i`: is a MetaPathways `parsed.txt` annotation file
* `-m`: is a MetaPathways mapping file `.mapping.txt`
* `--ncbi_tree`: the MetaPathways `ncbi_taxonomy_tree.txt`
* `-a`: computes all methods Majority, LCAStar, and LCA^2
* `-v`: verbose mode
* `--contig_taxa_ref`: file specificing the original taxonomy of input contigs as a tab-delimited file -
* `-o`: output text file
* [lca_star_analysis/](lca_star_analysis/): contains analysis code for the validation experiments found in the text. The main RMarkdown document can be found [here](lca_star_analysis/LCAStar.md).
* [python_resources/](python_resources/): contains the LCAStar Python library as well as other Python libraries required to perform the analysis.
* [resources/](resources/): other resource files required for the analysis
## Downloads
Some required files are too large to fit into a GitHub repository and can be found at the following links:
* [lca_star_data.zip](lca_star_data.zip): contains MetaPathways output and NCBI genome files used for the validation experiments
LCA\* is released as as a Python library on [anaconda](https://www.anaconda.com/download)

```bash
conda install -c hallamlab lcastar
```

## Usage

```python
from lcastar import LcaStar, Lineage
```

### *with scientific name (genus species)*
```python
orf_hits = [
"Muribaculaceae bacterium",
"Muribaculaceae bacterium",
"Bacteroidales bacterium",
"Muribaculaceae bacterium",
"Alistipes senegalensis",
]

tree = LcaStar()
for sci_name in orf_hits:
lin = Lineage.FromSciName(sci_name)
assert lin is not None
tree.NewObservation(lin)

for node in tree.BestLineage():
print(node.level, node.name, node.fraction_votes, node.p_value)
```

### *with NCBI taxonomy ID*
```python
orf_hits = [
2498093,
2498093,
2030927,
2498093,
1288121,
]

tree = LcaStar()
for tax_id in orf_hits:
lin = Lineage.FromTaxID(tax_id)
assert lin is not None
tree.NewObservation(lin)
```

### *output:*
```python
for node in tree.BestLineage():
print(node.level, node.name, node.fraction_votes, node.p_value, )
```
```
superkingdom Bacteria 1.0 0.08273697918531309
clade FCB group 1.0 0.08273697918531309
clade Bacteroidota/Chlorobiota group 1.0 0.08273697918531309
phylum Bacteroidota 1.0 0.08273697918531309
class Bacteroidia 1.0 0.08273697918531309
order Bacteroidales 1.0 0.08273697918531309
species Bacteroidales bacterium 0.2 1.0
```

## TODO:
* wait for [ete4](https://github.com/etetoolkit/ete) to be released on conda for python>=3.11 and switch over from pip
85 changes: 85 additions & 0 deletions conda_recipe/compile_recipe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
import os, sys
import stat
from pathlib import Path
import yaml

HERE = Path(os.path.realpath(__file__)).parent
sys.path = list(set([
str(HERE.joinpath("../").absolute())
]+sys.path))

# import constants from setup.py
from setup import USER, NAME, VERSION, ENTRY_POINTS, SHORT_SUMMARY

# ======================================================
# parse dependencies
with open(HERE.joinpath(f"../envs/base.yml")) as y:
raw_deps = yaml.safe_load(y)
def _parse_deps(level: list, compiled: str, depth: int):
tabs_space = " "*depth
for item in level:
# conda recipes can't have pip
# instead, a few can be added into the template, but these will not be tracked!
if not isinstance(item, str) or item in {"pip"}: continue
if isinstance(item, str):
compiled += f"{tabs_space}- {item}\n"
else:
k, v = list(item.items())[0]
compiled += f"{tabs_space}- {k}:\n"
compiled = _parse_deps(v, compiled, depth+1)
compiled = compiled[:-1] # remove trailing \n
return compiled
reqs = _parse_deps(raw_deps["dependencies"], "", 2)
python_dep = [d for d in raw_deps["dependencies"] if isinstance(d, str) and d.startswith("python=")]
if len(python_dep) < 1:
python_dep = ["python=3.11"]
python_ver = _parse_deps(python_dep, "", 2)

# ======================================================
# entry points

entry_points = ""
for e in ENTRY_POINTS:
tabs_space = " "*2
entry_points += f"{tabs_space}- {e}\n"
entry_points = entry_points[:-1] # remove trailing \n


# ======================================================
# path to tar archive of source code

dist_path = Path(os.path.abspath(HERE.joinpath("../dist")))
assert dist_path.exists(), f"did you forget to build the pip package first?"
tar_path = [dist_path.joinpath(f) for f in os.listdir(dist_path) if VERSION in f and ".tar.gz" in f][0]


# ======================================================
# generate recipe files

with open(HERE.joinpath("meta_template.yaml")) as f:
template = "".join(f.readlines())
meta_values = {
"USER": USER,
"NAME": NAME,
"SHORT_SUMMARY": SHORT_SUMMARY,
"VERSION": VERSION,
"ENTRY": entry_points,
"REQUIREMENTS": reqs,
"PYTHON": python_ver,
"TAR": f"file://{tar_path}",
}
for k, v in meta_values.items():
template = template.replace(f"<{k}>", v)
with open(HERE.joinpath("meta.yaml"), "w") as f:
f.write(template)

build_file = HERE.joinpath("call_build.sh")
with open(build_file, "w") as f:
channels = " ".join(f"-c {ch}" for ch in raw_deps["channels"])
_here = 'HERE=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )'
f.write(f"""\
{_here}
conda mambabuild {channels} --output-folder $HERE/../conda_build $HERE/
""".replace(" ", ""))
st = os.stat(build_file)
os.chmod(build_file, st.st_mode | stat.S_IEXEC)
37 changes: 37 additions & 0 deletions conda_recipe/meta_template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
package:
name: <NAME>
version: <VERSION>

source:
url: <TAR>

build:
noarch: python
number: 0
script: {{ PYTHON }} -m pip install https://github.com/etetoolkit/ete/archive/ete4.zip && {{ PYTHON }} -m pip install . -v
entry_points:
<ENTRY>

test:
imports:
- <NAME>
commands:
- pip check
- python -c "from lcastar import LcaStar, Lineage"
requires:
- pip

requirements:
host:
- pip
- cython
<PYTHON>

run:
<REQUIREMENTS>

about:
home: https://github.com/<USER>/<NAME>
summary: <SHORT_SUMMARY>
license: 'gpl-v3'
license_file: ../LICENSE
Loading

0 comments on commit 384d8a9

Please sign in to comment.