Minor documentation updates ahead of 0.5.0 release (FunctionLab#171)

* add CLI script back * fix URL on homepage * bug in master for the dev version and setup.py mismatch * update links in overview docs * add RELEASE_NOTES file * wording change * test .rst bullet points change
rfriedman22 · Jun 7, 2021 · 6fdaa86 · 6fdaa86
1 parent dba5855
commit 6fdaa86
Show file tree

Hide file tree

Showing 8 changed files with 46 additions and 26 deletions.
diff --git a/README.md b/README.md
@@ -4,6 +4,8 @@
 
 Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.
 
+Please see our [release notes](./RELEASE_NOTES.md) for the latest updates to Selene.
+
 ## Installation
 
 We recommend using Selene with Python 3.6 or above. 

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
@@ -0,0 +1,21 @@
+# Release notes 
+
+This is a document describing new functionality, bug fixes, breaking changes, etc. associated with Selene version releases from v0.5.0 onwards. 
+
+## Version 0.5.0
+
+### New functionality
+- `sampler.MultiSampler`: `MultiSampler` accepts any Selene sampler for each of the train, validation, and test partitions where previously `MultiFileSampler` only accepted `FileSampler`s. We will deprecate `MultiFileSampler` in our next major release. 
+- `DataLoader`: Parallel data loading based on PyTorch's `DataLoader` class, which can be used with Selene's `MultiSampler` and `MultiFileSampler` class. (see: `sampler.SamplerDataLoader`, `sampler.H5DataLoader`) 
+- To support parallelism via multiprocessing, the sampler that `SamplerDataLoader` used needs to be picklable. To enable this, opening file operations are delayed to when any method that needs the file is called. There is no change to the API and setting `init_unpicklable=True` in `__init__` for `Genome` and all `OnlineSampler` classes will fully reproduce the functionality in `selene_sdk<=0.4.8`. 
+- `sampler.RandomPositionsSampler`: added support for `center_bin_to_predict` taking in a list/tuple of two integers to specify the region from which to query the targets---that is, `center_bin_to_predict` by default (`center_bin_to_predict=<int>`) queries targets based on the center bin size, but can be specified as start and end integers that are not at the center if desired. 
+- `EvaluateModel`: accepts a list of metrics (by default computing ROC AUC and average precision) with which to evaluate the test dataset. 
+
+### Usage
+- **Command-line interface (CLI)**: You can now run the CLI directly with `python -m selene_sdk` (if you have cloned the repository, make sure you have locally installed `selene_sdk` via `python setup.py install`, or `selene_sdk` is in the same directory as your script / added to `PYTHONPATH`). Developers can make a copy of the `selene_sdk/cli.py` script and use it the same way that `selene_cli.py` was used in earlier versions of Selene (`python -u cli.py <config-yml> [--lr]`) 
+
+### Bug fixes
+- `EvaluateModel`: `use_features_ord` allows you to evaluate a trained model on only a subset of chromatin features (targets) predicted by the model. If you are using a `FileSampler` for your test dataset, you now have the option to pass in a subsetted matrix; however, this matrix must be ordered the same way as `features` (the original targets prediction ordering) and not in the same ordering as `use_features_ord`. However, the final model predictions and targets
+ (`test_predictions.npz` and `test_targets.npz`) will be outputted according to the `use_features_ord` list and ordering.
+- `MatFileSampler`: Previously the `MatFileSampler` reset the pointer to the start of the matrix too early (going back to the first sample before we had finished sampling the whole matrix). 
+- CLI learning rate: Edge cases (e.g. not specifying the learning rate via CLI or config) previously were not handled correctly and did not throw an informative error. 
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -10,7 +10,7 @@ Welcome! This is the documentation for Selene, a PyTorch-based deep learning lib
 The Github repository is located `here <https://github.com/FunctionLab/selene>`_.
 
 The documentation here corresponds to the latest version of Selene (i.e. up-to-date with `master`). 
-You can view the documentation for Selene `version 0.4.8 here`<http://selene.flatironinstitute.org/0.4.8/>`,
+You can view the documentation for Selene `version 0.4.8 here <http://selene.flatironinstitute.org/0.4.8/>`_,
 and we will add other older versions of the library docs to the website soon. 
 
 .. toctree::
@@ -40,6 +40,6 @@ and we will add other older versions of the library docs to the website soon.
 Indices and tables
 ==================
 
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
+- :ref:`genindex`
+- :ref:`modindex`
+- :ref:`search`
diff --git a/docs/source/overview/cli.rst b/docs/source/overview/cli.rst
@@ -26,12 +26,12 @@ Selene's CLI accepts configuration files in the `YAML <https://docs.ansible.com/
 We recommend you start off by using one of the `example configuration files <https://github.com/FunctionLab/selene/tree/master/config_examples>`_ provided in the repository as a template for your own configuration file:
 
 
-* `Training configuration <https://github.com/FunctionLab/selene/blob/master/config_examples/train.yml>`_
-* `Evaluate with test BED file <https://github.com/FunctionLab/selene/blob/master/config_examples/evaluate_test_bed.yml>`_
-* `Evaluate with test matrix file <https://github.com/FunctionLab/selene/blob/master/config_examples/evaluate_test_mat.yml>`_
-* `Get predictions from trained model <https://github.com/FunctionLab/selene/blob/master/config_examples/get_predictions.yml>`_
-* `\ *In silico* mutagenesis <https://github.com/FunctionLab/selene/blob/master/config_examples/in_silico_mutagenesis.yml>`_
-* `Variant effect prediction <https://github.com/FunctionLab/selene/blob/master/config_examples/variant_effect_prediction.yml>`_
+- `Training configuration <https://github.com/FunctionLab/selene/blob/master/config_examples/train.yml>`_
+- `Evaluate with test BED file <https://github.com/FunctionLab/selene/blob/master/config_examples/evaluate_test_bed.yml>`_
+- `Evaluate with test matrix file <https://github.com/FunctionLab/selene/blob/master/config_examples/evaluate_test_mat.yml>`_
+- `Get predictions from trained model <https://github.com/FunctionLab/selene/blob/master/config_examples/get_predictions.yml>`_
+- `\ *In silico* mutagenesis <https://github.com/FunctionLab/selene/blob/master/config_examples/in_silico_mutagenesis.yml>`_
+- `Variant effect prediction <https://github.com/FunctionLab/selene/blob/master/config_examples/variant_effect_prediction.yml>`_
 
 There are also various configuration files associated with the Jupyter notebook `tutorials <https://github.com/FunctionLab/selene/tree/master/tutorials>`_ and `manuscript <https://github.com/FunctionLab/selene/tree/master/manuscript>`_ case studies that you may use as a starting point.
 

diff --git a/docs/source/overview/overview.rst b/docs/source/overview/overview.rst
@@ -12,21 +12,21 @@ Sampling
 
 We start with the modules for sampling data because both training and evaluting a model in Selene will require a user to specify the kind of sampler they want to use. 
 
-*sequences* submodule (\ `API <http://selene.flatironinstitute.org/sequences.html>`_\ )
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+*sequences* submodule (\ `API <../sequences.html>`_\ )
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The *sequences* submodule defines the ``Sequence`` type, and includes implementations for several sub-classes.
 These sub-classes--\ ``Genome`` and ``Proteome``\ --represent different kinds of biological sequences (e.g. DNA, RNA, amino acid sequences), and implement the ``Sequence`` interface’s methods for reading the reference sequence from files (e.g. FASTA), querying subsequences of the reference sequence, and subsequently converting those queried subsequences into a numeric representation.
 Further, each sequence class specifies its own alphabet (e.g., nucleotides, amino acids) to represent query results as strings.
 
-*targets* submodule (\ `API <http://selene.flatironinstitute.org/targets.html>`_\ )
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+*targets* submodule (\ `API <../targets.html>`_\ )
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The *targets* submodule defines the ``Target`` class, which specifies the interface for classes to retrieve labels or “targets” for a given query sequence.
 At present, we supply a single implementation of this interface: ``GenomicFeatures``.
 This class takes a tabix-indexed file of intervals for each label we want our model to predict, and uses this file to identify the labels for a given sequence drawn from the reference.
 
-*samplers* submodule (\ `API <http://selene.flatironinstitute.org/samplers.html>`_\ )
+*samplers* submodule (\ `API <../samplers.html>`_\ )
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The *samplers* submodule provides methods and classes for randomly sampling and partitioning datasets for training and evaluation.
@@ -36,7 +36,7 @@ Further, a file of names must be provided for the features to be predicted.
 We provide several implementations adhering to the ``Sampler`` interface: the ``RandomPositionsSampler``\ , ``IntervalsSampler``\ , and ``MultiFileSampler``.
 
 ``MultiFileSampler`` draws samples from structured data files for each partition.
-There is currently support for loading either .bed or .mat files via the ``FileSampler`` classes ``BedFileSampler`` and ``MatFileSampler``\ , respectively (see `API docs for file samplers <http://selene.flatironinstitute.org/samplers.file_samplers.html>`_\ ).
+There is currently support for loading either .bed or .mat files via the ``FileSampler`` classes ``BedFileSampler`` and ``MatFileSampler``\ , respectively (see `API docs for file samplers <../samplers.file_samplers.html>`_\ ).
 It is worth noting that the .bed file used by ``BedFileSampler`` includes the coordinates of each sequence, and the indices corresponding to each feature for which said sequence is a positive example.
 We hope that users will request or contribute classes for other file samplers in the future.
 ``MultiFileSampler`` does not support saving the sampled data to a file, so calling the ``save_dataset_to_file`` method from this class will have no effect.
@@ -47,7 +47,7 @@ These samplers automatically partition said data according to user-specified par
 Since ``OnlineSampler``\ ’s samples are randomly generated, we allow the user to save the sampled data to file.
 This file can be subsequently loaded with the ``BedFileSampler``. They rely on classes from the *sequences* and *targets* submodules for retrieving each sequence and its targets in the proper matrix format. 
 
-Training a model (\ `API <http://selene.flatironinstitute.org/selene.html#trainmodel>`_\ )
+Training a model (\ `API <../selene.html#trainmodel>`_\ )
 ------------------------------------------------------------------------------------------
 
 The ``TrainModel`` class may be used for training and testing of sequence-based models, and provides the core functionality of the CLI’s train command.
@@ -58,14 +58,14 @@ The model’s loss, area under the receiver operating characteristic curve (AUC)
 The frequency of logging is provided by the user.
 At the end of evaluation, ``TrainModel`` logs the performance metrics for each feature predicted, and produces plots of the precision recall and receiver operating characteristic curves.
 
-Evaluating a model (\ `API <http://selene.flatironinstitute.org/selene.html#evaluatemodel>`_\ )
+Evaluating a model (\ `API <../selene.html#evaluatemodel>`_\ )
 -----------------------------------------------------------------------------------------------
 
 The ``EvaluateModel`` class is used to test the performance of a trained model. 
 ``EvaluateModel`` uses an instance of ``Sampler`` class or subclass to draw samples from a test set.
 After using the provided model to predict labels for said data, ``EvaluateModel`` logs the performance measures (as described in "Training a model") and generates figures and a performance breakdown by feature.
 
-Using a model to make predictions (\ `API <http://selene.flatironinstitute.org/predict.html>`_\ )
+Using a model to make predictions (\ `API <../predict.html>`_\ )
 -------------------------------------------------------------------------------------------------
 
 Selene’s ``predict`` submodule includes a number of methods and classes for making predictions with sequence-based models. 
@@ -74,14 +74,14 @@ It leverages a user-specified trained model to make predictions for sequences se
 In each case, the user can specify what ``AnalyzeSequences`` should save: raw predictions, difference scores, absolute difference scores, and/or logit scores.
 Note that the aforementioned “scores” can only be computed for *in silico* mutagenesis and variant effect prediction. 
 
-Visualizing model predictions (\ `API <http://selene.flatironinstitute.org/interpret.html>`_\ )
+Visualizing model predictions (\ `API <../interpret.html>`_\ )
 -----------------------------------------------------------------------------------------------
 
 The ``interpret`` submodule of ``selene_sdk`` provides methods for visualizing a sequence-based model’s predictions made with ``AnalyzeSequences``.
 For example, ``interpret`` includes methods for processing variant effect predictions made with ``AnalyzeSequences`` and subsequently visualizing them with a heatmap or sequence logo.
 The functionality included in the ``interpret`` submodule is not heavily incorporated into the CLI, but is instead intended for incorporation into user code.
 
-The utilities submodule (\ `API <http://selene.flatironinstitute.org/utils.html>`_\ )
+The utilities submodule (\ `API <../utils.html>`_\ )
 -------------------------------------------------------------------------------------
 
 Unlike the aforementioned submodules designed around individual concepts, the ``utils`` submodule is a catch-all submodule intended to prevent cluttering of the ``selene_sdk`` top-level namespace. 

diff --git a/docs/source/tutorials/analyzing_mutations_with_trained_models.nblink b/docs/source/tutorials/analyzing_mutations_with_trained_models.nblink
diff --git a/selene_sdk/samplers/dataloader.py b/selene_sdk/samplers/dataloader.py
@@ -1,5 +1,5 @@
 """
-This module provides the `SamplerDataLoader` and `SamplerDataSet` classes,
+This module provides the `SamplerDataLoader` and `SamplerDataset` classes,
 which allow parallel sampling for any Sampler using
 torch DataLoader mechanism.
 """

diff --git a/setup.py b/setup.py
@@ -25,7 +25,7 @@
 cmdclass = {'build_ext': build_ext}
 
 setup(name="selene-sdk",
- version="0.4.8",
+ version="0.5.dev0",
  long_description=long_description,
  long_description_content_type='text/markdown',
  description=("framework for developing sequence-level "