Skip to content

Commit

Permalink
added documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
qiyunzhu committed Feb 7, 2021
1 parent 1334f3f commit d9e0703
Show file tree
Hide file tree
Showing 10 changed files with 115 additions and 13 deletions.
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,12 @@ Woltka ships with a **QIIME 2 plugin**. [See here for instructions](woltka/q2).
- [Coordinates matching](doc/ordinal.md)
- [Stratification](doc/stratify.md)
- Profile tools
- [Collapsing](doc/collapse.md)
- [Collapse](doc/collapse.md), [Coverage](doc/coverage.md), [Filter](doc/filter.md), [Merge](doc/merge.md)
- Tutorials
- [Working with WoL](doc/wol.md)
- [gOTU analysis](doc/gotu.md)
- For users of
- [QIIME 2](woltka/q2), [Qiita](doc/app.md#qiita), [SHOGUN](doc/wol.md#sequence-alignment), [GTDB](doc/gtdb.md)
- [QIIME 2](woltka/q2), [Qiita](doc/app.md#qiita), [SHOGUN](doc/wol.md#sequence-alignment), [GTDB](doc/gtdb.md), [MetaCyc](doc/metacyc.md)
- References
- [Command-line interface](doc/cli.md)
- [Computational efficiency](doc/perform.md)
Expand All @@ -48,6 +48,10 @@ Woltka is a **classifier**. It serves as a middle layer between sequence alignme

Woltka processes **alignments** -- the mappings of query sequences against reference sequences (such as microbial genomes or genes), and infers the best placement of the queries in a hierarchical classification system. One query could have simultaneous matches in multiple references. Woltka finds the most suitable classification unit(s) to describe the query accordingly the criteria specified by the researcher. Woltka generates **profiles** (feature tables) -- the frequencies (counts) of classification units which describe the composition of samples.

### What else does Woltka do

Woltka provides several utilities for handling feature tables, including collapsing a table to higher-level features, calculating feature group coverage, filtering features based on per-sample abundance, and merging tables.

### What does Woltka not do

Woltka does NOT **align** sequences. You need to align your FastQ (or Fast5, etc.) files against a reference database (we recommend [WoL](https://biocore.github.io/wol/)) use an aligner of your choice (BLAST, Bowtie2, etc.). The resulting alignment files can be fed into Woltka.
Expand Down
20 changes: 19 additions & 1 deletion doc/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,9 @@ Option | Description

### Collapse

Collapse a profile based on feature mapping (supports **many-to-many** mapping).
Collapse a profile based on feature mapping (supports **many-to-many** mapping) (details).

* See [profile collapsing](collapse.md) for details.

Option | Description
--- | ---
Expand All @@ -118,3 +120,19 @@ Option | Description
`--output`, `-o` (required) | Path to output profile.
`--normalize`, `-z` | Count each target feature as 1 / _k_ (_k_ is the number of targets mapped to a source). Otherwise, count as one.
`--names`, `-n` | Path to mapping of target features to names. The names will be appended to the collapsed profile as a metadata column.


### Coverage

Calculate per-sample coverage of feature groups in a profile.

* See [feature group coverage](coverage.md) for details.

Option | Description
--- | ---
`--input`, `-i` (required) | Path to input profile.
`--map`, `-m` (required) | Path to mapping of source features to target features.
`--output`, `-o` (required) | Path to output profile.
`--threshold`, `-t` | Convert coverage to presence (1) / absence (0) data by this percentage threshold.
`--count`, `-c` | Record numbers of covered features instead of percentages (overrides threshold).
`--names`, `-n` | Path to mapping of feature groups to names. The names will be appended to the coverage table as a metadata column.
9 changes: 5 additions & 4 deletions doc/collapse.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,16 +53,17 @@ source4 <tab> target3
...
```

## Normalization
## Parameters

### Normalization

By default, if one source feature is simultaneously mapped to _k_ targets, each target will be counted once. With the `--normalize` or `-z` flag added to the command, each target will be counted 1 / _k_ times.

Whether to enable normalization depends on the nature and aim of your analysis. For example, one gene is involved in two pathways (which isn't uncommon), should each pathway be counted once, or half time?

### Feature names

## Feature names

Once a profile is collapsed, the metadata of the source features ("Name", "Rank", and "Lineage") will not be discarded. One may choose to supply a target feature name file by `--names` or `-n`, which will instruct the program to append names to the profile as a metadata column ("Name").
Once a profile is collapsed, the metadata of the source features ("Name", "Rank", and "Lineage") will be discarded. One may choose to supply a target feature name file by `--names` or `-n`, which will instruct the program to append names to the profile as a metadata column ("Name").


## Sample workflow
Expand Down
59 changes: 59 additions & 0 deletions doc/coverage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Feature group coverage

The **coverage** command calculates the coverage -- percentage of features present in each sample over a pre-defined group of features -- of a profile.

```bash
woltka tools coverage -i input.biom -m mapping.txt -o output.biom
```

A typical use case is to assess the likelihoods of presence of **metabolic pathways** in each organism or community. Because a pathway consists of _multiple_ chemical **reactions** or functional **genes** connected to each other, the presence of some of them (even with high abundance) in the sample does not necessarily suggest that the entire pathway is viable. Only when all or a large proportion of them are found can we be more confident about this hypothesis.

In this example, the input profile ([sample](../woltka/tests/data/output/truth.metacyc.tsv)) is a table of **genes**:

Feature ID | Sample 1 | Sample 2 | Sample 3 | Sample 4
--- | --- | --- | --- | ---
_plsC_ | 51 | 49 | 113 | 34
_fruK_ | 83 | 128 | 160 | 41
_panE_ | 0 | 53 | 0 | 39
_leuA_ | 111 | 262 | 232 | 77
... |

The mapping file ([sample](../woltka/tests/data/function/metacyc/pathway_mbrs.txt)) defines the member features (**genes**) of each feature group (**pathway**) (each line can have arbitrary number of fields; field delimiter is \<tab\>):

| | | | | | | |
|-|-|-|-|-|-|-|
| Asparagine biosynthesis | _asnB_ | _aspC_ |
| Biotin synthesis | _bioA_ | _bioB_ | _bioD_ | _bioF_ |
| NAD biosynthesis II | _hel_ | _nudC_ | _nadN_ | _pnuE_ | _nadR_ | _nadM_ |
| pyruvate decarboxylation | _aceE_ | _aceF_ | _lpd_ |
| ... |

The output file ([sample](../woltka/tests/data/output/truth.metacyc.coverage.tsv)) is a table of coverage values (percentages) per sample per feature group (**pathway**):

Feature ID | Sample 1 | Sample 2 | Sample 3 | Sample 4
--- | --- | --- | --- | ---
Biotin synthesis | 50.0 | 50.0 | 25.0 | 37.5
GDP-D-rhamnose biosynthesis | 20.0 | 80.0 | 20.0 | 80.0
L-glutamine degradation I | 100.0 | 100.0 | 50.0 | 0.0
Sucrose biosynthesis I | 20.0 | 20.0 | 20.0 | 20.0
... |


## Parameters

### Presence / absence

With parameter `--threshold` or `-t` followed by a percentage (e.g., `80`), the output coverage table will display binary results, with "**1**" representing coverage above or equal to this threshold and "**0**" being coverage below this threshold.

### Feature count

With flag `--count` or `-c`, the program will report the number of member features of a group present in a sample, instead of the percentage. Note: This will override `--threshold`.

### Feature group names

One can supply a mapping of feature groups to their names by `--names` or `-n`, and these names will be appended to the coverage table as a metadata column ("Name").


## Considerations

The coverage command will treat any feature count -- as low as **1** -- as the evidence of the feature's presence. False positives may be introduced if the profile has many noises. One may consider **filtering** the profile prior to running this command. Woltka provides a per-sample feature abundance [filtering](filter.md) function, in addition to the multiple filtering functions implemented in the QIIME 2 plugin [feature-table](https://docs.qiime2.org/2020.11/plugins/available/feature-table/).
9 changes: 9 additions & 0 deletions doc/filter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Per-sample filtering

The **filter** command filters each feature in each sample based on the absolute or relative abundance of that feature in that particular sample. For example, the following command will drop features that are less than 0.01% abundant in each sample:

```bash
woltka tools filter -i input.biom -o output.biom --min-percent 0.01
```

This function is especially useful in shotgun metagenomics, where very-low-abundance false positive assignments are prevalent and causing biases in downstream analyses ([Ye et al, 2019](https://www.cell.com/cell/fulltext/S0092-8674(19)30775-5)).
9 changes: 9 additions & 0 deletions doc/merge.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Merging profiles

The **merge** command merges two or more profiles into one, while treating overlapping samples and features in an additive way. This is useful when the analysis includes multiple sets of input files (e.g., multiple sequencing runs).

```bash
woltka tools merge -i input1.biom -i input2.biom -i input3.biom -o output.biom
```

The output file from the merge command is **identical** or nearly identical to the output file generated by merging sequence alignment file prior to running Woltka. Small errors (differring by the count of **1**) could be introduced during the normalization of _multiple assignments_ due to floating point arithmetic issues, which is usually not troublesome. In addition to sticking to one-to-one alignments, one can use classification parameters `--rank free`, `--uniq`, `--major`, or `--above` to prevent small errors ([see details](classify.md#ambiguous-assignment)).
8 changes: 5 additions & 3 deletions doc/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,10 +134,12 @@ woltka classify \

In `outmap_dir`, there will be three subdirectories: `phylum`, `genus` and `species`, eaching holding three read map files: `S1.txt.xz`, `S2.txt.xz` and `S3.txt.xz`.


## Table utilities

Woltka provides several utilities under the `tools` menu for table manipulation (both BIOM and TSV are supported and automatically recognized). Here are details:

- [**collapse**](collapse.md): Collapse a profile based on a source-to-target feature mapping; supporting many-to-many relationships.
- **filter**: Filter a profile by per-sample abundance.
- **merge**: Merge multiple profiles into one profile.
- [**Collapse**](collapse.md): Collapse a profile based on a source-to-target feature mapping; supporting many-to-many relationships.
- [**Coverage**](coverage.md): Calculate per-sample coverage of feature groups, such as the completeness of metabolic pathways or core gene sets.
- [**Filter**](filter.md): Filter features by per-sample abundance.
- [**Merge**](merge.md): Merge multiple profiles into one.
2 changes: 1 addition & 1 deletion woltka/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -289,7 +289,7 @@ def collapse_cmd(ctx, **kwargs):
help='Names of feature groups to append to the coverage table.')
@click.pass_context
def coverage_cmd(ctx, **kwargs):
"""Calculate coverage of feature groups in a profile.
"""Calculate per-sample coverage of feature groups.
"""
coverage_wf(**kwargs)

Expand Down
2 changes: 1 addition & 1 deletion woltka/q2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ qiime diversity core-metrics-phylogenetic \
Before moving to the next step (such as the command above), it is recommended to consider **filtering** the feature table by per-sample abundance. For example:

```bash
qiime woltka filter \
qiime woltka psfilter \
--i-table table.qza \
--min-percent 0.01 \
--o-filtered-table filtered.qza
Expand Down
2 changes: 1 addition & 1 deletion woltka/q2/plugin_setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@
outputs=[('coverage_table', FeatureTable[Frequency])],
output_descriptions={'coverage_table': 'Feature group coverage table.'},
name='Group coverage calculator',
description='Calculate a feature table\'s coverage over feature groups.',
description='Calculate per-sample coverage of feature groups.',
citations=[]
)

Expand Down

0 comments on commit d9e0703

Please sign in to comment.