Skip to content

Commit

Permalink
python: writing tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
bingmann committed Nov 5, 2019
1 parent ae08546 commit 170bb33
Show file tree
Hide file tree
Showing 6 changed files with 164 additions and 37 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ cmake-build-debug/
build/
cmake-build-release
python/docs/_build
python/docs/_generate
python/docs/_generated
*.cobs_cache
*.cobs_classic
*.cobs_compact
Expand Down
24 changes: 24 additions & 0 deletions misc/mkdocs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash -x
################################################################################
# mkdocs.sh
#
# Script to build and install python module and then to rebuild docs
#
# All rights reserved. Published under the MIT License in the LICENSE file.
################################################################################

set -e

pushd build/python
make -j8
cp \
cobs_index.cpython-36m-x86_64-linux-gnu.so \
~/.local/lib64/python3.6/site-packages/
popd

pushd python/docs
rm -rf _build _generated
make html
popd

################################################################################
6 changes: 3 additions & 3 deletions python/docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
#html_static_path = ['_static']

# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
Expand All @@ -164,7 +164,7 @@
# Custom sidebar templates, maps document names to template names.
#html_sidebars = {}
html_sidebars = {
'**': ['globaltoc.html', 'relations.html', 'sourcelink.html', 'searchbox.html'] }
'**': ['globaltoc.html', 'relations.html', 'searchbox.html'] }

# Additional templates that should be rendered to pages, maps page names to
# template names.
Expand All @@ -180,7 +180,7 @@
#html_split_index = False

# If true, links to the reST sources are added to the pages.
#html_show_sourcelink = True
html_show_sourcelink = False

# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
#html_show_sphinx = True
Expand Down
8 changes: 7 additions & 1 deletion python/docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
==========================================
COBS: A Compact Bit-Sliced Signature Index
==========================================

Expand All @@ -21,8 +22,13 @@ Timo Bingmann, Phelim Bradley, Florian Gauger, and Zamin Iqbal.
In: *26th International Symposium on String Processing and Information Retrieval (SPIRE)*. pages 285-303. Spinger. October 2019.
preprint arXiv:1905.09624.

:ref:`See the tutorial page<tutorial>` on how to use COBS in python scripts.

Table of Contents
=================

.. toctree::
:maxdepth: 2
:caption: Contents:

tutorial
cobs_index
118 changes: 118 additions & 0 deletions python/docs/tutorial.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
.. -*- mode: rst; mode: flyspell; ispell-local-dictionary: "en_US"; coding: utf-8 -*-
.. _tutorial:
.. currentmodule:: cobs_index

==================================
Tutorial for COBS Python Interface
==================================

Installation
------------

Installation of COBS with Python interface is easy using pip. The `package name
on PyPI <https://pypi.org/project/cobs_index>`_ is ``cobs_index`` and you need cmake
and a recent C++11 compiler to build the C++ library source.

.. code-block:: bash
$ pip install --user cobs_index
Document Lists
--------------

COBS can read and create an index from the following document types:

- FastA (``.fasta``, ``.fa``, ``.fasta.gz``, ``.fa.gz``)
- FastQ (``.fastq``, ``.fq``, ``.fastq.gz``, ``.fq.gz``)
- McCortex (``.ctx``, ``.cortex``)
- text files (``.txt``)
- MultiFastA (``.mfasta``)

The document types are identified by extension and compressed ``.gz`` files are
handled transparently. The set of k-mers extracted from each file type is
handled slightly differently: for FastA files each continuous subsequence is
broken into k-mers individually, while McCortex files explicitly list all
k-mers, and for text files the entire continuous file is broken into
k-mers. Each document creates one entry in the index, except for MultiFastA were
each subsequence is considered an individual document.

COBS usually scans a directory and creates an index containing all documents it
finds. For more fine-grain control, document lists are represented using
:class:`DocumentList` objects. DocumentLists can be created empty or by scanning
a directory, files can be added, and they contain :class:`DocumentEntry` objects
which can be iterated over.

.. code-block:: python
import cobs_index as cobs
doclist1 = cobs.DocumentList("/path/to/documents")
print("doclist1: ({} entries)".format(len(doclist1)))
for i, d in enumerate(doclist1):
print("doc[{}] name {} size {}".format(i, d.name, d.size))
doclist2 = cobs.DocumentList()
doclist2.add("/path/to/single/document.fa")
doclist2.add_recursive("/path/to/documents", cobs.FileType.Fasta)
print("doclist2: ({} entries)".format(len(doclist2)))
for i, d in enumerate(doclist2):
print("doc[{}] name {} size {}".format(i, d.name, d.size))
Index Construction
------------------

Compact indices are constructed using the functions :func:`compact_construct` or
:func:`compact_construct_list`. The first scans a directory for documents and
constructs an index from them, while the latter takes a explicit
:class:`DocumentList`. Note that the output index file *must* end with
``.cobs_compact``.

.. code-block:: python
cobs.compact_construct("/path/to/documents", "my_index.cobs_compact")
Parameters for index construction may be passed using a
:class:`CompactIndexParameters` object. See the class documentation for a
complete list of parameters. The default parameters are a reasonable choice for
most DNA k-mer applications.

.. code-block:: python
import cobs_index as cobs
p = cobs.CompactIndexParameters()
p.term_size = 31 # k-mer size
p.clobber = True # overwrite output and temporary files
p.false_positive_rate = 0.4 # higher false positive rate -> smaller index
cobs.compact_construct("/path/to/documents", "my_index.cobs_compact", index_params=p)
Besides compact indices, COBS also constructs and supports "classic"
indices. These are however usually not be used in practice and thus not further
discussed here.

Querying an Index
-----------------

To query an index, first load it using a :class:`Search` object. This method
detects the type of index, reads the metadata, and opens the entire file using
``mmap``.

Querying is performed with the :meth:`Search.search` method. This method returns
**a list containing pairs**: ``(#occurrences, document name)``.

.. code-block:: python
import cobs_index as cobs
s = cobs.Search("out.cobs_compact")
r = s.search("AGTCAACGCTAAGGCATTTCCCCCCTGCCTCCTGCCTGCTGCCAAGCCCT")
print(r)
# output: [(20, 'sample1'), (16, 'sample2'), ...]
With the default search parameters **all document scores** are returned. For
large corpora creating this Python list is a substantial overhead, such that the
result set should be limited using a) the ``threshold`` parameter or b) the
``num_results`` parameter. Threshold determines the fraction of k-mers in the
query a document be reach to be included in the result, while ``num_results``
simply limits the list size to a given number.
43 changes: 11 additions & 32 deletions python/module.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -41,14 +41,6 @@ cobs::FileType StringToFileType(std::string& s) {

/******************************************************************************/

cobs::DocumentList doc_list(const std::string& path, std::string file_type)
{
cobs::DocumentList filelist(path, StringToFileType(file_type));
return filelist;
}

/******************************************************************************/

void classic_construct(
const std::string& input, const std::string& out_file,
const cobs::ClassicIndexParameters& index_params,
Expand Down Expand Up @@ -97,13 +89,13 @@ namespace py = pybind11;

PYBIND11_MODULE(cobs_index, m) {
m.doc() = R"pbdoc(
COBS Python Interface
---------------------
COBS Python API Reference
-------------------------
.. currentmodule:: cobs_index
.. rubric:: Classes and Types
.. autosummary::
:toctree: _generate
:toctree: _generated
FileType
DocumentEntry
Expand All @@ -115,9 +107,8 @@ PYBIND11_MODULE(cobs_index, m) {
.. rubric:: Methods
.. autosummary::
:toctree: _generate
:toctree: _generated
doc_list
classic_construct
classic_construct_list
compact_construct
Expand All @@ -133,7 +124,7 @@ PYBIND11_MODULE(cobs_index, m) {
py::arg("disable") = true);

/**************************************************************************/
// DocumentList and doc_list()
// DocumentList

using cobs::FileType;
py::enum_<FileType>(
Expand Down Expand Up @@ -176,9 +167,9 @@ PYBIND11_MODULE(cobs_index, m) {
using cobs::DocumentList;
py::class_<DocumentList>(
m, "DocumentList",
"List of DocumentEntry objects returned by doc_list()")
"List of DocumentEntry objects for indexing")
.def(py::init<>(),
"default constructor, construct empty list.")
"default constructor, constructs an empty list.")
.def(py::init<std::string, FileType>(),
"construct and add path recursively.",
py::arg("root"),
Expand Down Expand Up @@ -213,18 +204,6 @@ PYBIND11_MODULE(cobs_index, m) {
// essential: keep object alive while iterator exists
py::keep_alive<0, 1>());

m.def(
"doc_list", &doc_list, R"pbdoc(
Read a list of documents and returns them as a DocumentList containing DocumentEntry objects
:param str path: path to documents to list
:param str file_type: filter input documents by file type (any, text, cortex, fasta, etc), default: any
)pbdoc",
py::arg("path"),
py::arg("file_type") = "any");

/**************************************************************************/
// ClassicIndexParameters

Expand Down Expand Up @@ -286,7 +265,7 @@ Construct a COBS Classic Index from a path of input files.
)pbdoc",
py::arg("input"),
py::arg("out_file"),
py::arg("index_params"),
py::arg("index_params") = ClassicIndexParameters(),
py::arg("file_type") = "any",
py::arg("tmp_path") = "");

Expand All @@ -303,7 +282,7 @@ Construct a COBS Classic Index from a pre-populated DocumentList object.
)pbdoc",
py::arg("list"),
py::arg("out_file"),
py::arg("index_params"),
py::arg("index_params") = ClassicIndexParameters(),
py::arg("tmp_path") = "");

/**************************************************************************/
Expand Down Expand Up @@ -365,7 +344,7 @@ Construct a COBS Compact Index from a path of input files.
)pbdoc",
py::arg("input"),
py::arg("out_file"),
py::arg("index_params"),
py::arg("index_params") = CompactIndexParameters(),
py::arg("file_type") = "any",
py::arg("tmp_path") = "");

Expand All @@ -382,7 +361,7 @@ Construct a COBS Compact Index from a pre-populated DocumentList object.
)pbdoc",
py::arg("list"),
py::arg("out_file"),
py::arg("index_params"),
py::arg("index_params") = CompactIndexParameters(),
py::arg("tmp_path") = "");

/**************************************************************************/
Expand Down

0 comments on commit 170bb33

Please sign in to comment.