added documentation

qiyunzhu · Feb 7, 2021 · d9e0703 · d9e0703
1 parent 1334f3f
commit d9e0703
Show file tree

Hide file tree

Showing 10 changed files with 115 additions and 13 deletions.
diff --git a/README.md b/README.md
@@ -25,12 +25,12 @@ Woltka ships with a **QIIME 2 plugin**. [See here for instructions](woltka/q2).
   - [Coordinates matching](doc/ordinal.md)
   - [Stratification](doc/stratify.md)
 - Profile tools
-  - [Collapsing](doc/collapse.md)
+  - [Collapse](doc/collapse.md), [Coverage](doc/coverage.md), [Filter](doc/filter.md), [Merge](doc/merge.md)
 - Tutorials
   - [Working with WoL](doc/wol.md)
   - [gOTU analysis](doc/gotu.md)
 - For users of
-  - [QIIME 2](woltka/q2), [Qiita](doc/app.md#qiita), [SHOGUN](doc/wol.md#sequence-alignment), [GTDB](doc/gtdb.md)
+  - [QIIME 2](woltka/q2), [Qiita](doc/app.md#qiita), [SHOGUN](doc/wol.md#sequence-alignment), [GTDB](doc/gtdb.md), [MetaCyc](doc/metacyc.md)
 - References
   - [Command-line interface](doc/cli.md)
   - [Computational efficiency](doc/perform.md)
@@ -48,6 +48,10 @@ Woltka is a **classifier**. It serves as a middle layer between sequence alignme
 
 Woltka processes **alignments** -- the mappings of query sequences against reference sequences (such as microbial genomes or genes), and infers the best placement of the queries in a hierarchical classification system. One query could have simultaneous matches in multiple references. Woltka finds the most suitable classification unit(s) to describe the query accordingly the criteria specified by the researcher. Woltka generates **profiles** (feature tables) -- the frequencies (counts) of classification units which describe the composition of samples.
 
+### What else does Woltka do
+
+Woltka provides several utilities for handling feature tables, including collapsing a table to higher-level features, calculating feature group coverage, filtering features based on per-sample abundance, and  merging tables.
+
 ### What does Woltka not do
 
 Woltka does NOT **align** sequences. You need to align your FastQ (or Fast5, etc.) files against a reference database (we recommend [WoL](https://biocore.github.io/wol/)) use an aligner of your choice (BLAST, Bowtie2, etc.). The resulting alignment files can be fed into Woltka.

diff --git a/doc/cli.md b/doc/cli.md
@@ -109,7 +109,9 @@ Option | Description
 
 ### Collapse
 
-Collapse a profile based on feature mapping (supports **many-to-many** mapping).
+Collapse a profile based on feature mapping (supports **many-to-many** mapping) (details).
+
+* See [profile collapsing](collapse.md) for details.
 
 Option | Description
 --- | ---
@@ -118,3 +120,19 @@ Option | Description
 `--output`, `-o` (required) | Path to output profile.
 `--normalize`, `-z` | Count each target feature as 1 / _k_ (_k_ is the number of targets mapped to a source). Otherwise, count as one.
 `--names`, `-n` | Path to mapping of target features to names. The names will be appended to the collapsed profile as a metadata column.
+
+
+### Coverage
+
+Calculate per-sample coverage of feature groups in a profile.
+
+* See [feature group coverage](coverage.md) for details.
+
+Option | Description
+--- | ---
+`--input`, `-i` (required) | Path to input profile.
+`--map`, `-m` (required) | Path to mapping of source features to target features.
+`--output`, `-o` (required) | Path to output profile.
+`--threshold`, `-t` | Convert coverage to presence (1) / absence (0) data by this percentage threshold.
+`--count`, `-c` | Record numbers of covered features instead of percentages (overrides threshold).
+`--names`, `-n` | Path to mapping of feature groups to names. The names will be appended to the coverage table as a metadata column.
diff --git a/doc/collapse.md b/doc/collapse.md
@@ -53,16 +53,17 @@ source4 <tab> target3
 ...
 ```
 
-## Normalization
+## Parameters
+
+### Normalization
 
 By default, if one source feature is simultaneously mapped to _k_ targets, each target will be counted once. With the `--normalize` or `-z` flag added to the command, each target will be counted 1 / _k_ times.
 
 Whether to enable normalization depends on the nature and aim of your analysis. For example, one gene is involved in two pathways (which isn't uncommon), should each pathway be counted once, or half time?
 
+### Feature names
 
-## Feature names
-
-Once a profile is collapsed, the metadata of the source features ("Name", "Rank", and "Lineage") will not be discarded. One may choose to supply a target feature name file by `--names` or `-n`, which will instruct the program to append names to the profile as a metadata column ("Name").
+Once a profile is collapsed, the metadata of the source features ("Name", "Rank", and "Lineage") will be discarded. One may choose to supply a target feature name file by `--names` or `-n`, which will instruct the program to append names to the profile as a metadata column ("Name").
 
 
 ## Sample workflow

diff --git a/doc/coverage.md b/doc/coverage.md
@@ -0,0 +1,59 @@
+# Feature group coverage
+
+The **coverage** command calculates the coverage -- percentage of features present in each sample over a pre-defined group of features -- of a profile.
+
+```bash
+woltka tools coverage -i input.biom -m mapping.txt -o output.biom
+```
+
+A typical use case is to assess the likelihoods of presence of **metabolic pathways** in each organism or community. Because a pathway consists of _multiple_ chemical **reactions** or functional **genes** connected to each other, the presence of some of them (even with high abundance) in the sample does not necessarily suggest that the entire pathway is viable. Only when all or a large proportion of them are found can we be more confident about this hypothesis.
+
+In this example, the input profile ([sample](../woltka/tests/data/output/truth.metacyc.tsv)) is a table of **genes**:
+
+Feature ID | Sample 1 | Sample 2 | Sample 3 | Sample 4
+--- | --- | --- | --- | ---
+_plsC_ | 51 | 49 | 113 | 34
+_fruK_ | 83 | 128 | 160 | 41
+_panE_ | 0 | 53 | 0 | 39
+_leuA_ | 111 | 262 | 232 | 77
+... |
+
+The mapping file ([sample](../woltka/tests/data/function/metacyc/pathway_mbrs.txt)) defines the member features (**genes**) of each feature group (**pathway**) (each line can have arbitrary number of fields; field delimiter is \<tab\>):
+
+| | | | | | | |
+|-|-|-|-|-|-|-|
+| Asparagine biosynthesis | _asnB_ | _aspC_ |
+| Biotin synthesis | _bioA_ | _bioB_ | _bioD_ | _bioF_ |
+| NAD biosynthesis II | _hel_ | _nudC_ | _nadN_ | _pnuE_ | _nadR_ | _nadM_ |
+| pyruvate decarboxylation | _aceE_ | _aceF_ | _lpd_ |
+| ... |
+
+The output file ([sample](../woltka/tests/data/output/truth.metacyc.coverage.tsv)) is a table of coverage values (percentages) per sample per feature group (**pathway**):
+
+Feature ID | Sample 1 | Sample 2 | Sample 3 | Sample 4
+--- | --- | --- | --- | ---
+Biotin synthesis | 50.0 | 50.0 | 25.0 | 37.5
+GDP-D-rhamnose biosynthesis | 20.0 | 80.0 | 20.0 | 80.0
+L-glutamine degradation I | 100.0 | 100.0 | 50.0 | 0.0
+Sucrose biosynthesis I | 20.0 | 20.0 | 20.0 | 20.0
+... |
+
+
+## Parameters
+
+### Presence / absence
+
+With parameter `--threshold` or `-t` followed by a percentage (e.g., `80`), the output coverage table will display binary results, with "**1**" representing coverage above or equal to this threshold and "**0**" being coverage below this threshold.
+
+### Feature count
+
+With flag `--count` or `-c`, the program will report the number of member features of a group present in a sample, instead of the percentage. Note: This will override `--threshold`.
+
+### Feature group names
+
+One can supply a mapping of feature groups to their names by `--names` or `-n`, and these names will be appended to the coverage table as a metadata column ("Name").
+
+
+## Considerations
+
+The coverage command will treat any feature count -- as low as **1** -- as the evidence of the feature's presence. False positives may be introduced if the profile has many noises. One may consider **filtering** the profile prior to running this command. Woltka provides a per-sample feature abundance [filtering](filter.md) function, in addition to the multiple filtering functions implemented in the QIIME 2 plugin [feature-table](https://docs.qiime2.org/2020.11/plugins/available/feature-table/).
diff --git a/doc/filter.md b/doc/filter.md
@@ -0,0 +1,9 @@
+# Per-sample filtering
+
+The **filter** command filters each feature in each sample based on the absolute or relative abundance of that feature in that particular sample. For example, the following command will drop features that are less than 0.01% abundant in each sample:
+
+```bash
+woltka tools filter -i input.biom -o output.biom --min-percent 0.01
+```
+
+This function is especially useful in shotgun metagenomics, where very-low-abundance false positive assignments are prevalent and causing biases in downstream analyses ([Ye et al, 2019](https://www.cell.com/cell/fulltext/S0092-8674(19)30775-5)).
diff --git a/doc/merge.md b/doc/merge.md
@@ -0,0 +1,9 @@
+# Merging profiles
+
+The **merge** command merges two or more profiles into one, while treating overlapping samples and features in an additive way. This is useful when the analysis includes multiple sets of input files (e.g., multiple sequencing runs).
+
+```bash
+woltka tools merge -i input1.biom -i input2.biom -i input3.biom -o output.biom
+```
+
+The output file from the merge command is **identical** or nearly identical to the output file generated by merging sequence alignment file prior to running Woltka. Small errors (differring by the count of **1**) could be introduced during the normalization of _multiple assignments_ due to floating point arithmetic issues, which is usually not troublesome. In addition to sticking to one-to-one alignments, one can use classification parameters `--rank free`, `--uniq`, `--major`, or `--above` to prevent small errors ([see details](classify.md#ambiguous-assignment)).
diff --git a/doc/output.md b/doc/output.md
@@ -134,10 +134,12 @@ woltka classify \
 
 In `outmap_dir`, there will be three subdirectories: `phylum`, `genus` and `species`, eaching holding three read map files: `S1.txt.xz`, `S2.txt.xz` and `S3.txt.xz`.
 
+
 ## Table utilities
 
 Woltka provides several utilities under the `tools` menu for table manipulation (both BIOM and TSV are supported and automatically recognized). Here are details:
 
-- [**collapse**](collapse.md): Collapse a profile based on a source-to-target feature mapping; supporting many-to-many relationships.
-- **filter**: Filter a profile by per-sample abundance.
-- **merge**: Merge multiple profiles into one profile.
+- [**Collapse**](collapse.md): Collapse a profile based on a source-to-target feature mapping; supporting many-to-many relationships.
+- [**Coverage**](coverage.md): Calculate per-sample coverage of feature groups, such as the completeness of metabolic pathways or core gene sets.
+- [**Filter**](filter.md): Filter features by per-sample abundance.
+- [**Merge**](merge.md): Merge multiple profiles into one.
diff --git a/woltka/cli.py b/woltka/cli.py
@@ -289,7 +289,7 @@ def collapse_cmd(ctx, **kwargs):
     help='Names of feature groups to append to the coverage table.')
 @click.pass_context
 def coverage_cmd(ctx, **kwargs):
-    """Calculate coverage of feature groups in a profile.
+    """Calculate per-sample coverage of feature groups.
     """
     coverage_wf(**kwargs)
 

diff --git a/woltka/q2/README.md b/woltka/q2/README.md
@@ -136,7 +136,7 @@ qiime diversity core-metrics-phylogenetic \
 Before moving to the next step (such as the command above), it is recommended to consider **filtering** the feature table by per-sample abundance. For example:
 
 ```bash
-qiime woltka filter \
+qiime woltka psfilter \
   --i-table table.qza \
   --min-percent 0.01 \
   --o-filtered-table filtered.qza

diff --git a/woltka/q2/plugin_setup.py b/woltka/q2/plugin_setup.py
@@ -201,7 +201,7 @@
     outputs=[('coverage_table', FeatureTable[Frequency])],
     output_descriptions={'coverage_table': 'Feature group coverage table.'},
     name='Group coverage calculator',
-    description='Calculate a feature table\'s coverage over feature groups.',
+    description='Calculate per-sample coverage of feature groups.',
     citations=[]
 )