RNA-seq (RNA-sequencing) is a technique that can examine the quantity and sequences of RNA in a sample using next-generation sequencing (NGS). Over the past few years, RNA sequencing (RNA-seq) has become an indispensable tool for transcriptome-wide analysis of differential gene expression and differential splicing of mRNAs. It is rapidly replacing gene expression microarrays in many labs as it lets you quantify, discover, and profile RNAs.
Several tools and pipelines exist for RNA-Seq data analysis. Different consortiums and institutions use different sets of guidelines and standards for their data analysis. The H3ABioNet has developed a standard SOP and guidelines for RNA-Seq data analysis with some recommendations for gene expression analysis in human.
In this repo, we document RNA-seq data analysis following this guidelines developed by H3ABioNet. Data used in this project is available here
Phase I (Pre-processing analysis), Time: 1 week.
Tools : fastqc - v01109 , Trimmomatic - 0.39v, Cut-adapt - v2.8
- Download raw reads
- Check quality of the raw reads
- Adapter removal and quality trimming
- Quality recheck
Phase II (Gene Expression Analysis ), Time: 3 Weeks.
Generate gene/transcript level counts
Tool - kallisto- v0.46.2, -salmon v0.12.0, Hisat v2.1.0, feature counts - v2.0.0
- Align reads to reference genome
- Generate estimated counts using pseudo-alignment approach
- Collecting and tabulating alignment stats
Phase III (R - Analysis ), Time: 2 weeks
Tool - DESeq2 v3.12 , EdgeR V3.12
- QC and outlier removal / Batch detection.
- Answer general questions of the project
- wrap-up
Report Genaration(1 week)
- Comparison of outputs from each tool in each processing step.
Create conda environment
$ conda create --name [environ_name]
Activate conda environment
$ conda activate [environ_name]
Install tools
$ conda install [toolname] -c bioconda
Tool name | Version | Use |
---|---|---|
Fastqc | 0.11.9 | Check the quality of the reads |
Trimmomatic | 0.39 | Trim adapter remnants and low quality reads |
Kallisto | 0.46.2 | pseudo-alignment and gene counts |
FeatureCounts | 2.0.0 | Perform gene counts |
Salmon | 0.12.0 | Pseudo-alignment and gene counts |
cutadapt | 2.8 | Trim adapter remnants and low quality reads |
R-Analysis
Package | Use |
---|---|
DESeq2 | To analyse count data and test for differential expression. |
rhdf5 | To read abundance.h5 file |
tximport | To import abundance.h5 file |
pheatmap | To draw clustered heatmaps |
RcolorBrewer | Contains a ready-to-use color palettes for creating heatmaps |
tximportData | Provides output of running Kallisto |
Phase 1
- Download raw reads
- Quality check of the raw reads
- Adapter removal and quality trimming
- Quality recheck
Phase 2
- Alignment
- Trascripts/gene counts
- Collect and tabulate statistics
Phase 3
- Statistical analysis:
- QC check
- Outlier removal and normalization
- Differential expression
How to use the provided scripts for analysis
Hisat pipeline
-
The documents are found here
-
First, put your raw reads and metadata in one file.In case of HPC make sure you
module load
all the tools required for this pipeline. You will begin with checking the quality of your reads using Fastqc.Here you will get the information on which reads to trim or not. Those that require trimming to remove low quality reads and reads that have a shorter length than your preffered length will proceed for trimming using Trimmomatic. This was done using this script -
Allignment of the reads requires a reference genome that will used to create an index to be used for the allignment. Using the command
wget
you can obtain the fasta file in relation to your reads and use hisat2 in creation of indeces and proceed for allignment of the reads.This was done using this script -
When using HISAT2 the counts are obtained using a different tool. In this pipeline, we used features count to count the reads that alligned to the indexes created from the reference genome.The counts were done using this script
-
The counts obtained from featuresCount were used for statistical analysis in R using DESeq2. The statistical analysis done are contained in the DESeq2 Rmd.
Salmon Pipeline
The scripts are found here.
Data
Raw data/reads from the sequencer, Metadata(Downloaded from here), Reference genome,downloaded
here and the Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
in one directory as the scripts.
Phase I (Pre-processing)
Quality control check was done using this, fastqc_quality_check.sh script.
Data cleaning involves , removal of adapter remnants, short reads and low quality bases. Cutadapt trimming tool was preffered and the script cutadapt.sh was used.
Quality-recheck after trimming is necessary to examine the extent to which your data was cleaned and this was achieved using fastqc_quality_recheck.sh script.
Phase II (Gene Expression Analysis)
Involves Alignment of reads, Gene counts and Tabulating of the statistics, the script salmon.sh was used.
Phase III (Statistical analysis/Differential Expression)
EdgeR package was used for normalization, statistical analysis and visualization of the gene counts using this EdgeR_Analysis_script.Rmd script. The generated html document for the EdgeR can be assesed here.
Kallisto Pipeline
The scripts can be accessed from this repo or here.
Data
The code for downloading the required data is included in the scripts, since the data is huge it takes time to download the data. Incase you've already downloaded your data as from above then you can hash out the download codes from the scripts or if you wish to obtain raw data separately then use the below links.
Raw data reads and metadata can be downloaded from here incase you didn't download them from above pipelines. Reference genome can also be downloaded from here.
Phase I (Pre-processing)
Quality control check was done using fastqc tool which informed the data cleaning parameters in the next step. Data cleaning involves , removal of adapter remnants, short reads and low quality bases. Trimmomatic trimming tool was preffered in this pipeline.
Quality-recheck after trimming is necessary to examine the extent to which your data was cleaned.This was also done using Fastqc.
The combined script for Fastqc and Trimmomatic with details are in fastqc-trimmomatic.sh script.
Phase II (Gene Expression Analysis)
Involves Alignment of reads, Gene counts and Tabulating of the statistics, the script kallisto.sh was used to achieve this.
Phase III (Statistical analysis/Differential Expression)
DESeq2 package was used for normalization, statistical analysis and visualization of the gene counts using this kallisto_Deseq_analysis.Rmd script.
The generated html document for the DESeq2 can be assesed here
Conclusion
The final analysis report is available here