Skip to content

A tool for an in-depth comparison and benchmarking of peptide encodings.

License

Notifications You must be signed in to change notification settings

spaenigs/peptidereactor

Repository files navigation

logo

A workflow for an in-depth comparison and benchmarking of peptide encodings. All computations are highly parallelized and work efficiently across multiple datasets and encodings. For a thorough introduction refer to Spänig et al. (2021). The visualizations can be interactively accessed at https://peptidereactor.mathematik.uni-marburg.de/.

Overview

image info

The emphasis is put on high-throughput processing of an arbitrary amount of input datasets (arrows), followed by the preprocessing, encoding, and postprocessing, generating the final output (top). The preprocessing includes sanitizing of the input sequences and the tertiary structure approximation, among others. Afterwards, the sequences as well as the accompanied structures are used for the encoding. The postprocessing involves machine learning and the visual preparation of the analyses, among others.

Installation

  1. Clone this repo:
    git clone [email protected]:spaenigs/peptidereactor.git.

  2. cd into the root directory (peptidereactor/)

  3. Install conda:
    peptidereactor/conda/install.sh.

  4. Create the conda environment:
    peptidereactor/conda/create_env.sh

  5. Install docker:

    • Ubuntu: ./peptidereactor/install_docker_io
    • Other distros: ./peptidereactor/install_docker_ce
  6. Build images:

    • ./peptidereactor/docker/build_image
    • ./peptidereactor-vis/docker/build_image

Execution

  1. Jobs to be executed:

    • ./main.py --quiet --dag | dot -Tsvg > dag.svg (DAG)
    • ./main.py --quiet -nr (list)
  2. Run the pipeline:
    ./main.py --quiet

  3. Results:

Custom nodes

The implementation of the PEPTIDE REACToR follows a modular design, such that meta nodes can be connected almost arbitrarily. Moreover, custom nodes can be easily added. Note that this tool is based on Snakemake, hence all conditions for a valid Snakemake-workflow must be also fulfilled.

Example

A node to conduct multiple sequence alignment. Since we might need such a general node multiple times, we add it to the utils category.

  1. mkdir nodes/utils/multiple_sequence_alignment

  2. touch nodes/utils/multiple_sequence_alignment/Snakefile

  3. Specify input and output via config dictionary, e.g., config["fastas_in"] and config["fastas_out"].

  4. Copy/paste into the Snakefile and adapt stub:

    # from ... import ...
    
    TOKEN = config["token"]  # access unique token
    
    rule all:
        input:
             config["fastas_out"]
    
    rule multiple_sequence_alignment:
        input:
             config["fastas_in"]
        output:
             config["fastas_out"]
        run:
             pass
  5. touch nodes/utils/multiple_sequence_alignment/__init__.py for the API and copy/paste the following:

     import secrets
     
     # rule name
     def _get_header(token):
         return f'''
     rule utils_multiple_sequence_alignment_{token}:'''
     
     ... 
    
     # specify input, output and path to the Snake- and configuration file.
     def _get_main(fastas_in, fastas_out):
         return f'''
         input:
              fastas_in={fastas_in}
         output:
              fastas_out={fastas_out}
         ...
     '''
     
     # specify input and output
     def rule(fastas_in, fastas_out, benchmark_dir=None):
         token = secrets.token_hex(4)
         rule = _get_header(token)
         if benchmark_dir is not None:
             benchmark_out = f"{benchmark_dir}utils_multiple_sequence_alignment_{token}.txt"
         ...

    Refer to an actual __init__.py for a complete example.

  6. Make the node visible by adding from . import multiple_sequence_alignment in nodes/utils/__init__.py.

  7. Import and use the node in main.py:

    import nodes.utils as utils
    
    w.add(utils.multiple_sequence_alignment.rule(
          fastas_in=["data/{dataset}/seqs_mapped.fasta", "data/{dataset}/seqs_sec.fasta",
                     "data/{dataset}/seqs_ter.fasta"],
          fastas_out=["data/{dataset}/seqs_msa.fasta", "data/{dataset}/seqs_msa_sec.fasta",
                      "data/{dataset}/seqs_msa_ter.fasta"],
          benchmark_dir=w.benchmark_dir))
  8. During implementation, it might be helpful to run the rule isolated:

    ./peptidereactor/run_pipeline -s nodes/utils/multiple_sequence_alignment/Snakefile \
                                  --config fasta_in=... fasta_out=... token=... \
                                  -nr

    or even access the Docker container interactively:

    docker run -it --entrypoint "/bin/bash" peptidereactor

Note that the PEPTIDE REACToR is still under development, thus changes are likely. However, the fundamental structure, highlighted above, will remain.

About

A tool for an in-depth comparison and benchmarking of peptide encodings.

Resources

License

Stars

Watchers

Forks