nf-exec

Introduction

Tools to reproduce the steps to run nf-core pipelines for bioinformatics analysis within Linux environments.

Installation

Conda

Run the silent installation of Miniconda/Anaconda in case you don't have this software in your environment.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3

Docker

Run the installation of Docker in case you don't have this software in your environment.

 sudo apt-get update

 sudo apt-get install \
    ca-certificates \
    curl \
    gnupg \
    lsb-release

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io

sudo docker run hello-world

Pipelines

Running Sarek (nf-core)

Sarek is a workflow designed to detect variants on whole genome or targeted sequencing data. Initially designed for Human, and Mouse, it can work on any species with a reference genome.

Reference genomes

Creating the data structure:

mkdir -p data/reference data/out data/samples

Downloading commonly used bioinformatics reference genomes. In this case, GATK files for GRCh38 and GRCh37:

make gatk-grch38
make gatk-grch37

Usage

Modify <my_sample_id> and <data_dir> by the actual variables and setting environment variables:

$SAMPLE_ID=<my_sample_id>
$DATA_DIR=<data_dir>

$REFERENCE_DIR=$DATA_DIR/reference
$SAMPLE_DIR=$DATA_DIR/out/$SAMPLE_ID
$WORK_DIR=$DATA_DIR/out/$SAMPLE_ID/work
$RESULTS_DIR=$DATA_DIR/out/$SAMPLE_ID/results

Creating the directory structure with symbolic links (check this guide here in case you don't know how to deal with symbolic links):

mkdir -p $SAMPLE_ID # creating the new directory for the new upcoming sample

ln -s $SAMPLE_DIR"/"$SAMPLE_ID_1".fastq.gz" $SAMPLE_ID"/samples/sample1_1.fastq.gz" # sample1_1.fastq.gz
ln -s $SAMPLE_DIR"/"$SAMPLE_ID"_2.fastq.gz" $SAMPLE_ID"/samples/sample1_2.fastq.gz" # sample1_2.fastq.gz
ln -s $REFERENCE_DIR $SAMPLE_ID/reference # genome reference directory
ln -s $WORK_DIR $SAMPLE_ID/work # work directory
ln -s $RESULTS_DIR $SAMPLE_ID/results # results directory

Run the pipeline:

bash run.sh

The run.sh script is a bash script that runs the pipeline and has the following content:

#!/bin/bash
nextflow \
	run nf-core/sarek -r 2.7.1 \
	-params-file "params.json" \
	-work-dir "work" \
	-profile "docker"

Sample directory structure:

$ nf-exec/sample_template/: tree -L 3
.
├── params.json # parameters for the pipeline
├── references -> data/references # reference genome directory
├── resources.json # resources for the pipeline (Slurm)
├── results -> data/out/SAMPLE_ID/results # results directory
├── run.sh # script to run the pipeline
├── run.slurm # script to run the pipeline (Slurm)
├── samples # directory with the fastq.gz files
│   ├── sample1.fastq.gz -> data/samples/SAMPLE_ID_1.fastq.gz # 5'->3' paired-end reads
│   └──sample2.fastq.gz -> data/samples/SAMPLE_ID_2.fastq.gz # 3'->5' paired-end reads
├── samples.tsv # table with the metadata for the fastq.gz files
└── work -> data/out/SAMPLE_ID/work # work directory

1 directory, 10 files

Running a different pipeline (from nf-core)

If you want to perform a different analysis, below is an example of how to run a RNA-seq pipeline:

Downloading genome references. In this case, NCBI Reference Genome for GRCh38:

make download-ncbi-grch38

Reproduce the same 2 and 3 steps as before changing the ids for control and case.
Create a new samples.csv instead of the original one.

$SAMPLE_ID=<my_sample_id>
mkdir -p $SAMPLE_ID # creating the new directory for the new upcoming sample
touch $SAMPLE_ID/samples.csv

Edit this samples.csv file following this reference:

sample,fastq_1,fastq_2,strandedness
CONTROL,samples/<SAMPLE_ID_control>_1.fastq.gz,samples/<SAMPLE_ID_control>_1.fastq.gz,reverse
TREATMENT,samples/<SAMPLE_ID_treatment>_1.fastq.gz,samples/<SAMPLE_ID_treatment>_1.fastq.gz,reverse

Modify the <SAMPLE_ID>/run.sh file to run the RNA-seq pipeline:

#!/bin/bash
nextflow \
	run nf-core/rnaseq -r 3.6 \
	-params-file "params.json" \
	-work-dir "work" \
	-profile "docker"

Directory structure

$ nf-exec/: tree -L 3
.
├── data # data directory to store all the input and output files
    .
    ├── out # output directory to store all the output files (results and work)
    ├── references # directory to store the reference genomes
    └── samples # directory to store the samples raw data 
├── environment.yml # environment variables for reproducing the pipeline within a conda environment
├── LICENSE # license file
├── Makefile # makefile to facilite the pipeline management
├── README.md # readme file
├── sample_template # sample template directory
└── scripts # scripts directory
    .
    └── utils.py # utility functions written in Python to post-process the output files

3 directories, 4 files

Author information and license

Fernando Pozo (@fpozoca – [email protected])

Distributed under the GNU General Public License. See LICENSE for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nf-exec

Table of contents

Introduction

Installation

Conda

Docker

Pipelines

Running Sarek (nf-core)

Reference genomes

Usage

Sample directory structure:

Running a different pipeline (from nf-core)

Directory structure

Author information and license

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
sample_template		sample_template
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml

License

fpozoc/nf-exec

Folders and files

Latest commit

History

Repository files navigation

nf-exec

Table of contents

Introduction

Installation

Conda

Docker

Pipelines

Running Sarek (nf-core)

Reference genomes

Usage

Sample directory structure:

Running a different pipeline (from nf-core)

Directory structure

Author information and license

About

Topics

Resources

License

Stars

Watchers

Forks

Languages