Skip to content

noepozzan/small-peptide-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ci GitHub license

SMAPP

  1. You have experimental data of ribosome profiling.
  2. You also have mass spectrometry data from the same samples.
  3. You are searching for novel (small) peptides.

=> Then you are at the right address.

SMAPP (Small Peptide Pipeline) is a workflow that allows you to analyze data from ribosome profiling for the existence of small unannotated open reading frames (smORFs). If, in addition, you have matching mass spectrometry data, SMAPP uses the mass spectrometry data to validate your predicted smORFs. The workflow relies on publicly available bioinformatics tools and is developed in Nextflow, a widely used workflow management system in the bioinformatics community.

According to the current SMAPP implementation, reads are first pre-processed and then filtered against a library of rRNA. Quality control with state-of-the-art tools gives you meaningful initial insights into the quality and composition of your ribo-Seq library. After mapping of the ribosome protected fragments (RPFs) to your reference of choice, the aligned reads are again quality checked and smORFs are predicted, mainly relying on the concept of periodicity. For the analysis of your mass spectra you need a database to search against. The predicted peptides in fasta format from upstream steps offer you an optimal database to choose, compared with searching against a database like uniprot which contains too many proteins, augmenting the risk of false positives. Additional reports summarise the results of the individual steps and provide useful visualisations.

Note:
If you have either ribosome profiling data or mass spectrometry data, you can still use the worklow.
Please skip to the part about profiles.
For a more detailed description of each step, please refer to the workflow documentation.

Requirements

The workflow has been tested on:

  • CentOS 7 & 7.9
  • macOS 12.3.1

NOTE: Currently, only Mac & Linux execution is supported.

Installation

1. Clone the repository

Go to the desired directory/folder on your file system, then clone/get the repository and move into the respective directory with:

git clone https://github.com/noepozzan/small-peptide-pipeline
cd small-peptide-pipeline

2. Conda installation

Workflow dependencies can be conveniently installed with the Conda package manager. We recommend that you install Miniconda for your system (Linux). Be sure to select the Python 3 option. The workflow was built and tested with miniconda 4.13.0. Other versions are not guaranteed to work as expected.

3. Dependencies installation

For improved reproducibility and reusability of the workflow, each individual step of the workflow runs in its own Singularity or Docker container. As a consequence, running this workflow has very few individual dependencies. As the functional installation of Singularity and Docker require root privilege, the installation instructions are slightly different depending on your system/setup:

Singularity and/or Docker installation

Please install Singularity or install Docker separately and in privileged mode, depending on your system. You may have to ask an authorized person (e.g., a systems administrator) to do that. This will almost certainly be required if you want to run the workflow on a high-performance computing (HPC) cluster.

NOTE: The workflow has been tested with the following versions:

  • Singularity v3.8.5-1.el7
  • Docker 20.10.17

After the installation has completed, install the remaining dependencies with:

conda env create -f install/environment.yml

4. Activate environment

Activate the Conda environment with:

conda activate small_peptides

5. Before running this pipeline

1. This workflow relies on many external tools. (That is why this pipeline relies on software packaged into containers.)

One of those is MSFragger.
Since MSFragger is only free for non-commercial use, you should run:

  • If you'll be working with Docker:
cd <main directory of this project>
source data/scripts/docker_envs.sh
  • Or, if you will be using Singularity:
cd <main directory of this project>
source data/scripts/singularity_envs.sh

This sets environment variables that allow you to pull the private MSFragger image from noepozzan's dockerhub repository. (If you click on the link, you won't see the image since it's private.)

2. The best way to work with Singularity & Nextflow and avoid errors, is to pull the images preemptively.

(This will take between 5 and 15 minutes.)

Attention: Only run this if you have Singularity installed.

cd <main directory of this project>
bash data/scripts/pull_containers.sh

Extra installation steps (optional)

6. Non-essential dependencies installation

Most tests have additional dependencies. If you are planning to run tests, you will need to install these by executing the following command in your active Conda environment:

conda env update -f install/environment.dev.yml

7. Successful installation tests

1. Since even the testing files for this pipeline are quite large, I provide a github repo to pull from.

If you do not have git lfs installed, please install it and then run the commands shown below:

cd <main directory of this project>
bash data/scripts/pull_test_data.sh

This puts the test files in the right place for the tests to pass.
Note that for this and other tests to complete successfully, be sure to have the additional dependencies installed.

2. Remember to activate the conda environment and give the tests enough time (between 2 and 5 minutes).

3. Execute one of the following commands to run the test workflow:

  • Test workflow with Docker:

    pytest --tag integration_test_docker --basetemp=${HOME}/test_runs
  • Test workflow with Singularity (locally):

    pytest --tag integration_test_singularity --basetemp=${HOME}/test_runs

Or on a Slurm-managed high-performance computing (HPC) cluster:

  • Test workflow with Singularity & Slurm:

    pytest --tag integration_test_slurm --basetemp=${HOME}/test_runs

Running the workflow on your own samples

If you want to run the workflow on your own files, running it is pretty straightforward:

cd <project's main directory>
nextflow run main.nf -profile <profile of your choice>,<profile that fits your work environment>

But before you start, you have to get the configuration right. As you see above, this workflow needs 2 profiles:

  • <profile of your choice>: Where you provide the paths to the files and parameters for the tools included in the workflow.
    You find these files under conf/params/.
  • <profile that fits your work environment>: Where you detail the memory and the CPUs of your system/environment.
    You find these files under conf/envs/.

1. You have the choice of running the workflow in different configurations:

(substitute one of the below options for the <profile of choice> above)

  • full: to run the full pipeline (this is computationally quite heavy and should be done in a cluster environment)
  • test: to only run the test pipeline with small files
  • qc: to only run the quality control part of the pipeline
  • prepare: to prepare the reads
  • ribotish: to predict small peptides from your ribosome profiling data (if you don't have mass spec data)
  • proteomics: to search your mass spectra files (.raw, .mzML) against a database
  • fasta: to run the workflow if you already have preprocessed files (comes in handy sometimes..)

IMPORTANT: The profile you choose must match the .config file you adapt. So, if you choose the profile full, you have to specify the paths to your files in the conf/params/full.config configuration file.
Use your editor of choice to populate these files with the correct paths to your own files. Every config files indicates the variables necessary to run the workflow in the way you want it to.

2. Have a look at the examples in the conf/ directory to see what the files should look like, specifically:

3. Pick one of the following choices for either local or cluster execution:

  • slurm: for cluster execution (needs singularity installed)
  • slurm_offline: for cluster execution (needs singularity installed, is the safer way to run. Please try this if above fails)
  • singularity: for local execution (needs singularity installed)
  • singularity_offline: for local execution (needs singularity installed, is the safer way to run. Please try this if above fails)
  • docker: for local execution (needs docker installed and the daemon running)

NOTE: Depending on the configuration of your Slurm installation you may need to adapt the files under the conf/envs/ directory and the arguments to options memory and cpus in the *.config file of the respective profile. Consult the manual of your workload manager as well as the section of the nextflow manual dealing with profiles.

4. Start your workflow run (finally):

Either, to view the output directly in your terminal:

nextflow run main.nf -profile <profile of your choice>,<profile that fits your work environment>

Or to have the workflow run in the background:
(Practical if you need to leave your computer while still running the pipeline.)
This option requires you to copy the exact nextflow command you intend to run into the slurm.script, which you'll find in the project's main directory.

sbatch slurm.script

If you have come this far, you are very welcome (& encouraged) to contribute to this project.

Please just open an issue or a pull request.