- You have experimental data of ribosome profiling.
- You also have mass spectrometry data from the same samples.
- You are searching for novel (small) peptides.
SMAPP (Small Peptide Pipeline) is a workflow that allows you to analyze data from ribosome profiling for the existence of small unannotated open reading frames (smORFs). If, in addition, you have matching mass spectrometry data, SMAPP uses the mass spectrometry data to validate your predicted smORFs. The workflow relies on publicly available bioinformatics tools and is developed in Nextflow, a widely used workflow management system in the bioinformatics community.
According to the current SMAPP implementation, reads are first pre-processed and then filtered against a library of rRNA. Quality control with state-of-the-art tools gives you meaningful initial insights into the quality and composition of your ribo-Seq library. After mapping of the ribosome protected fragments (RPFs) to your reference of choice, the aligned reads are again quality checked and smORFs are predicted, mainly relying on the concept of periodicity. For the analysis of your mass spectra you need a database to search against. The predicted peptides in fasta format from upstream steps offer you an optimal database to choose, compared with searching against a database like uniprot which contains too many proteins, augmenting the risk of false positives. Additional reports summarise the results of the individual steps and provide useful visualisations.
Note:
If you have either ribosome profiling data or mass spectrometry data, you can still use the worklow.
Please skip to the part about profiles.
For a more detailed description of each step, please refer to the workflow documentation.
The workflow has been tested on:
- CentOS 7 & 7.9
- macOS 12.3.1
NOTE: Currently, only Mac & Linux execution is supported.
Go to the desired directory/folder on your file system, then clone/get the repository and move into the respective directory with:
git clone https://github.com/noepozzan/small-peptide-pipeline
cd small-peptide-pipeline
Workflow dependencies can be conveniently installed with the Conda
package manager. We recommend that you install Miniconda
for your system (Linux). Be sure to select the Python 3 option.
The workflow was built and tested with miniconda 4.13.0
.
Other versions are not guaranteed to work as expected.
For improved reproducibility and reusability of the workflow, each individual step of the workflow runs in its own Singularity or Docker container. As a consequence, running this workflow has very few individual dependencies. As the functional installation of Singularity and Docker require root privilege, the installation instructions are slightly different depending on your system/setup:
Please install Singularity or install Docker separately and in privileged mode, depending on your system. You may have to ask an authorized person (e.g., a systems administrator) to do that. This will almost certainly be required if you want to run the workflow on a high-performance computing (HPC) cluster.
NOTE: The workflow has been tested with the following versions:
Singularity v3.8.5-1.el7
Docker 20.10.17
After the installation has completed, install the remaining dependencies with:
conda env create -f install/environment.yml
Activate the Conda environment with:
conda activate small_peptides
1. This workflow relies on many external tools. (That is why this pipeline relies on software packaged into containers.)
One of those is MSFragger.
Since MSFragger is only free for non-commercial use, you should run:
- If you'll be working with Docker:
cd <main directory of this project>
source data/scripts/docker_envs.sh
- Or, if you will be using Singularity:
cd <main directory of this project>
source data/scripts/singularity_envs.sh
This sets environment variables that allow you to pull the private MSFragger image from noepozzan's dockerhub repository. (If you click on the link, you won't see the image since it's private.)
2. The best way to work with Singularity & Nextflow and avoid errors, is to pull the images preemptively.
(This will take between 5 and 15 minutes.)
Attention: Only run this if you have Singularity installed.
cd <main directory of this project>
bash data/scripts/pull_containers.sh
Most tests have additional dependencies. If you are planning to run tests, you will need to install these by executing the following command in your active Conda environment:
conda env update -f install/environment.dev.yml
1. Since even the testing files for this pipeline are quite large, I provide a github repo to pull from.
If you do not have git lfs
installed, please install it and then run the commands shown below:
cd <main directory of this project>
bash data/scripts/pull_test_data.sh
This puts the test files in the right place for the tests to pass.
Note that for this and other tests to complete successfully, be sure to have the additional dependencies installed.
2. Remember to activate the conda environment and give the tests enough time (between 2 and 5 minutes).
-
Test workflow with Docker:
pytest --tag integration_test_docker --basetemp=${HOME}/test_runs
-
Test workflow with Singularity (locally):
pytest --tag integration_test_singularity --basetemp=${HOME}/test_runs
Or on a Slurm-managed high-performance computing (HPC) cluster:
-
Test workflow with Singularity & Slurm:
pytest --tag integration_test_slurm --basetemp=${HOME}/test_runs
If you want to run the workflow on your own files, running it is pretty straightforward:
cd <project's main directory>
nextflow run main.nf -profile <profile of your choice>,<profile that fits your work environment>
But before you start, you have to get the configuration right. As you see above, this workflow needs 2 profiles:
<profile of your choice>
: Where you provide the paths to the files and parameters for the tools included in the workflow.
You find these files underconf/params/
.<profile that fits your work environment>
: Where you detail the memory and the CPUs of your system/environment.
You find these files underconf/envs/
.
(substitute one of the below options for the <profile of choice>
above)
full
: to run the full pipeline (this is computationally quite heavy and should be done in a cluster environment)test
: to only run the test pipeline with small filesqc
: to only run the quality control part of the pipelineprepare
: to prepare the readsribotish
: to predict small peptides from your ribosome profiling data (if you don't have mass spec data)proteomics
: to search your mass spectra files (.raw
,.mzML
) against a databasefasta
: to run the workflow if you already have preprocessed files (comes in handy sometimes..)
IMPORTANT: The profile you choose must match the .config
file you adapt.
So, if you choose the profile full
, you have to specify the paths to your files in the conf/params/full.config
configuration file.
Use your editor of choice to populate these files with the correct paths to your own files.
Every config files indicates the variables necessary to run the workflow in the way you want it to.
2. Have a look at the examples in the conf/
directory to see what the files should look like, specifically:
- full.config
- slurm.config
- For more details and explanations, refer to the pipeline-documentation
slurm
: for cluster execution (needs singularity installed)slurm_offline
: for cluster execution (needs singularity installed, is the safer way to run. Please try this if above fails)singularity
: for local execution (needs singularity installed)singularity_offline
: for local execution (needs singularity installed, is the safer way to run. Please try this if above fails)docker
: for local execution (needs docker installed and the daemon running)
NOTE: Depending on the configuration of your Slurm installation you may need to adapt the files under the
conf/envs/
directory and the arguments to optionsmemory
andcpus
in the*.config
file of the respective profile. Consult the manual of your workload manager as well as the section of the nextflow manual dealing with profiles.
Either, to view the output directly in your terminal:
nextflow run main.nf -profile <profile of your choice>,<profile that fits your work environment>
Or to have the workflow run in the background:
(Practical if you need to leave your computer while still running the pipeline.)
This option requires you to copy the exact nextflow command you intend to run into the slurm.script
,
which you'll find in the project's main directory.
sbatch slurm.script
Please just open an issue or a pull request.