Skip to content

A continually expanding collection of RNA-seq tools

License

Notifications You must be signed in to change notification settings

eshwari-ravi/RNA-seq_notes

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A continually expanding collection of RNA-seq tools

MIT License PR's Welcome

RNA-seq related tools and genomics data analysis resources. Please, contribute and get in touch! See MDmisc notes for other programming and genomics-related notes.

Table of content

Pipelines

Preprocessing

  • Check strandedness of RNA-Seq fastq files

  • Illumina Instrument Type from fastq

  • adapters - Adapter sequences for trimming, by Stephen Turner

  • bamcov - Quickly calculate and visualize sequence coverage in alignment files in command line

  • bamtocov - coverage extraction from BAM/CRAM files to wig format

  • bamcount - BigWig and BAM utilities, coverage, by Ben Langmead

  • bigwig-nim - single static binary + liberal license of tool to convert bed to bigwig (and back) and get fast coverage stats from bigwig, by Brent Pedersen, Twitter

  • cgpBigWig - Package of C scripts for generation of BigWig coverage files.

  • covviz - calculate and view coverage based variation

  • indexcov - Quickly estimate coverage from a whole-genome bam or cram index

  • fastq-pair - Match up paired end fastq files quickly and efficiently

  • FastUniq - an ultrafast de novo duplicates removal tool for paired short DNA sequences.

  • faster - A (very) fast program for getting statistics about a fastq file, written in Rust. Get the read length, GC content, mean Phred scores, trim frong and tail, regex search. Compiled binaries are available

  • rasusa - Randomly subsample sequencing reads to a specified coverage, single- and paired end reads

Aligners

  • Chromap - ultra-fast aligner (>10X faster) for ChIP-seq, Hi-C, scATAC-seq. Based on the minimizer sketch. Memory depends only on genome index size, ~20Gb for human.
    Paper Zhang, Haowen, Li Song, Xiaotao Wang, Haoyu Cheng, Chenfei Wang, Clifford A. Meyer, Tao Liu, et al. “Fast Alignment and Preprocessing of Chromatin Profiles with Chromap.” Nature Communications, 12 November 2021, https://doi.org/10.1038/s41467-021-26865-w
  • SNAP - paired-read short-read (70-300bp) aligner based on fussy set intersection. 2-5x faster than BWA-mem2, Bowtie2. When used with Haplotype Caller from the Genome Analysis Toolkit, SNAP produces better concordance with known-truth sets than other aligners for most of the genome-in-a-bottle and Illumina Platinum genomes. Additonal features: accepts SAM and BAM, outputs sorted, duplicate marked and indexed file. Binaries for Windows, Mac, Linux. Tweet.
    Paper Bolosky, William J., Arun Subramaniyan, Matei Zaharia, Ravi Pandya, Taylor Sittler, and David Patterson. “Fuzzy Set Intersection Based Paired-End Short-Read Alignment.” Preprint. Bioinformatics, November 23, 2021. https://doi.org/10.1101/2021.11.23.469039.

Long-read

  • Minimap2 - aligner for long- (SMRT, ONT technologies, over 1kb) and short- (over 100bp, paired-end supported) reads. Spli-read alignment, gap cost for long insertions and deletions, reduces spurious alignment. 3-4 tiimes faster than short-read aligners (C and Python implementation), over 30 times faster than long-read aligners (BLASR, BWA-MEM, GraphMap, minialign, NGMLR). Presets of parameters.
    Paper Li, Heng. “Minimap2: Pairwise Alignment for Nucleotide Sequences.” Edited by Inanc Birol. Bioinformatics 34, no. 18 (September 15, 2018): 3094–3100. https://doi.org/10.1093/bioinformatics/bty191.
  • lorax - A long-read analysis toolbox for cancer genomics applications. Requires matched tumor-normal data sequenced using long-reads.

  • NGMLR - long-read mapper designed to align PacBio or Oxford Nanopore (standard and ultra-long) to a reference genome with a focus on reads that span structural variations

  • Sniffles - structural variation caller using third generation sequencing (PacBio or Oxford Nanopore).

Analysis

Quality control

  • Qualimap - Qualimap 2 is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data and its derivatives like feature counts. Supported types of experiments include: Whole-genome sequencing, Whole-exome sequencing, RNA-seq (speical mode available), ChIP-seq

  • fastp - fast C++ parallelized tool for FASTQ quality control, adapter trimming, quality filtering, pruning, polyX (polyG) trimming, works with single- and paired-end data. GitHub

  • MultiQC - Summarization and visualization QC results for multiple samples in one report. Recognizes multiple QC tools

  • ngsReports - An R Package for managing FastQC reports and other NGS related log files. biorXiv

  • sickle - A windowed adaptive trimming tool for FASTQ files using quality. Post-adapter trimming step

  • fastqc - an R package for quality control (QC) of short read fastq files, analog of the original FASTQC

  • FastQt - FastQC port to Qt5: A quality control tool for high throughput sequence data

  • fastqcheck - Generate statistics on and validate fastq files

Imputation

  • fancyimpute - Multivariate imputation and matrix completion algorithms implemented in Python. Algorithms: SimpleFill, KNN, SoftImpute, IterativeImputer, IterativeSVD, MatrixFactorization, NuclearNormMinimization, BiScaler.

  • softImpute - R package for Matrix Completion via Iterative Soft-Thresholded SVD, by Trevor Hastie and Rahul Mazumder.

Batch effect

  • ComBat-seq - batch effect correction for RNA-seq data using negative binomial regression. Maintains count nature of RNA-seq data. Tested on simulated data (polyester package), and experimental data. Achieves the highest true positive rate. Code to reproduce paper

  • combat.py - python / numpy / pandas / patsy version of ComBat for removing batch effects

Clustering

  • clust - Python package for identification of consistently coexpressed clusters of genes, within and across datasets. Consensus clustering principle. Aproximately 50% of genes do not cluster well and thus shouldn't be considered. Compared with seven tools (cross-clustering, k-means, SOMs, MCL, HC, Click, WGCNA) using seven different cluster validation metrics. Outperforms all, produces more focused and significant functional enrichment results.

Timecourse

  • LPWC - Lag Penalized Weighted Correlation, a similarity measure to group pairs of time series that are not perfectly synchronized. Review of previous approaches (hierarchical clustering, partition-based, Bayesian models). Correlation-based, with the lag penalty for shift, two options to select it. Best for 5 or more time points, for shorter time course - either use high penalty or another tool STEM. Tested on simulated data (ImpulsDE data) using the adjusted Rand index

  • Time course gene expression analysis review. Biological scenarios requiring a time-course, analytical approaches, Table 1 - software for time course analysis (EDGE, BETR, clustering tools, network analysis).

  • DREM 2.0 - time course analysis of gene expression data. Detects patterns of gene expression changes and the corresponding transcription factors driving them, motif discovery using protein-DNA (ChIP-seq, ChIP-chip, computational) data, differential motif analysis (DECOD method). Hidden Markov Model-based algorithm. Java tool, GUI and command line interface.

Differential expression

  • Glimma 2.0 - R package for interactive visualization of DESeq2, edgeR, Limma objects. MDS plot, MA plot, Volcano plot. D3, htmlwidgets, plotly, dygraphs. Plots are embeddable in RMarkdown. Export data as CSV, PNG, SVG. Bioconductor.
    Paper Kariyawasam, Hasaru, Shian Su, Oliver Voogd, Matthew E Ritchie, and Charity W Law. “Dashboard-Style Interactive Plots for RNA-Seq Analysis Are R Markdown Ready with Glimma 2.0.” NAR Genomics and Bioinformatics 3, no. 4 (October 4, 2021): lqab116. https://doi.org/10.1093/nargab/lqab116.

Allele-specific expression

  • SNPsplit - allele-specific splitting of SAM/BAM alignments using known SNP genotypes. Based on alignment to SNP-masked genomes (N). Perl, three scripts (SNPsplit, SNPsplit_genome_preparation, tag2sort). Input - SAM/BAM files, an annotation file containing the positions of all SNPs in the genome. Performs 1) read tagging and 2) read sorting. Classifies into Allele 1-specific, Allele 2-specific, Unassigned, Conflicting. Works with DNA-seq, RNA-seq, Hi-C, Bisulfite-seq data, single- and paired end, aligned by various aligners. GitHub.
    Paper Krueger, Felix, and Simon R. Andrews. “SNPsplit: Allele-Specific Splitting of Alignments between Genomes with Known SNP Genotypes.” F1000Research 5 (July 2016): 1479. https://doi.org/10.12688/f1000research.9037.2.

Functional enrichment

  • eVITTA - web tool for interactive gene expression and functional enrichment (GSEA, overrepresentation) analyses. Three modules: 1) easyGEO, retrieval and analysis of GEO datasets; 2) easyGSEA, GSEA or overrepresentation enrichment analyses; 3) easyVizR, comparison among experimental groups (overlap of gene lists and enrichment results). Input: DESeq2/edgeR output, (ranked) gene lists. Output: interactive barplots, heatmaps, volcano plots (plotly), rank-rank hypergeometric overlap (RRHO) plot, networks, enrichment tables. Figure 1, Table 1 - overview of analyses, inputs, outputs. GitHub.
    Paper Cheng, Xuanjin, Junran Yan, Yongxing Liu, Jiahe Wang, and Stefan Taubert. “EVITTA: A Web-Based Visualization and Inference Toolbox for Transcriptome Analysis.” Nucleic Acids Research 49, no. W1 (July 2, 2021): W207–15. https://doi.org/10.1093/nar/gkab366.

Transcription regulators

  • ChEA3 - predicting regulatory TFs for sets of user-provided genes. Improved backend reference gene set data (six datasets), ranking of the most significantly enriched TFs. Benchmarking against several other TF prioritization tools (overviewed in intro). Docker, API web-interface and downloadable data

  • RABIT - find TFs regulating a list of genes. Integrated ChIP-seq and gene expression data, regression framework. Tested in experimental KO data, tumor-profiling cohorts.

  • RcisTarget - finding enriched motifs in cis-regulatory regions in a gene list

Non-canonical RNAs

Alternative splicing

miRNAs

  • CancerMIRNome - web server for exploratory miRNA analysis in TCGA cancers and circulating microRNA studies. Query individual miRNAs, cancers. Differential expression, ROC for predicting tumor-normal distinction, survival plots, miRNA-target correlation, functional enrichment of targets. GitHub

  • PharmacomiR - miRNA-drug associations analysis

  • microRNAome - read counts for microRNAs across tissues, cell-types, and cancer cell-lines, SummarizedExperiment R package

  • miRNAmeConverter - Convert miRNA Names to Different miRBase Versions

  • MIENTURNET - web tool for miRNA-target enrichment analysis, prioritization, network visualization, functional enrichment for microRNA target genes.

  • miRDB - database for miRNA target prediction and functional annotations. The targets were predicted by MirTarget from RNA-seq and CLIP-seq data. Five species: human, mouse, rat, dog and chicken. Custom target prediction. Cell line-specific. Integrative analysis of target prediction and Gene Ontology data.

  • TAM2 - miRNA enrichment analysis. Manually curated and established miRNA sets. Single list analysis, up vs downregulated. Complementary tools - miSEA, miEAA

    • Li, Jianwei, Xiaofen Han, Yanping Wan, Shan Zhang, Yingshu Zhao, Rui Fan, Qinghua Cui, and Yuan Zhou. “TAM 2.0: Tool for MicroRNA Set Analysis.” Nucleic Acids Research 46, no. W1 (July 2, 2018)
  • miRsponge - identification and analysis of miRNA sponge interaction networks and modules. Seven methods for miRNA sponge interaction detection (miRHomology, pc, sppc, hermes, ppc, muTaME, and cernia), and integrative method, description of each method. Four module detection methods (FN, MCL, LINKCOMM, MCODE), description of each. Enrichment analyses - disease (DO, DisGeNet, Network of Cancer Genes), functions (GO, KEGG, REACTOME). Survival analysis.

  • MirGeneDB - standardized microRNA database, 1288 microRNA families across 45 species. Downloadable, FASTA, GFF, BED files. Nomenclature refs 19, 20.

  • miRPathDB - miRNA-pathway association database, human, mouse

lncRNAs

  • LncSEA - long non-coding RNA database and enrichment analysis. Covers over 50K lncRNAs, contains reference sets in 18 categories (Accessible chromatin, enhancer, super enhancer, transcription factor, survival, Drug, Disease, Cancer hallmark, subsellular location etc., Supplementary Table 2 and 3 - data sources) and 66 subcategories (based on specific attributes, overlap/proximal/closest, cancer subtypes, etc.), Table 1. Hypergeometric enrichment, Jaccard, Simpson overlaps, Correction for multiple testing (BH, Bonferroni). ID conversion. Supplementary Material 2 - details of categories. Previous tools: Co-LncRNA, Lnc-GFP, FARNA, LnCompare (Supplementary Table 1 - Comparison of LncSEA with other databases and tools). Supplementary material. GitHub.
    Paper Chen, Jiaxin, Jian Zhang, Yu Gao, Yanyu Li, Chenchen Feng, Chao Song, Ziyu Ning, et al. “LncSEA: A Platform for Long Non-Coding RNA Related Sets and Enrichment Analysis,” Nucleic Acids Research, 8 January 2021. https://doi.org/10.1093/nar/gkaa806
  • lncRNAKB - database of long noncoding RNAs. lncRNAs are typically less conserved, expressed low on average and highly tissue-specific. Combines six resources (CHESS, LNCipedia, NONCODE, FANTOM, MiTranscriptome, BIGTranscriptome). Information about tissue-specific expression, eQTL, WGCNA co-expression to predict functions in a tissue-specific manner, random forest prediction of protein-coding score. Data: GTF gene annotation, tissue-specific expression (TPM, counts, eQTL). RNA-seq blog post

  • UClncR - detecting and quantifying expression of unknown and known lncRNAs. Works for unstranded and stranded RNA-seq. Incorporates StringTie, Sebnif for novel lncRNA detection, iSeeRNA for assessing noncoding potential. Annotates lncRNAs by the nearby protein-coding genes. Tested on real data using Gencode annotations with parts of lncRNA annotations removed.

circRNAs

  • DCC Python scripts and CircTest R visualization package - circular RNA detection. DCC uses STAR output (chimeric.out.junction) and detects back-splice junctions, filters, integrated replicate data. A much higher precision than competitors (CIRI, KNIFE), similar sensitivity. Tests for host gene-independence of circRNA expression across different experimental conditions.
    Paper Cheng, Jun, Franziska Metge, and Christoph Dieterich. "Specific identification and quantification of circular RNAs from sequencing data." Bioinformatics 32, no. 7 (2016): 1094-1096. https://doi.org/10.1093/bioinformatics/btv656
  • CIRCpedia database of cornRNAs from human, mouse, and some model organisms. Ribo-, poly(A)-, RNAse R methods for enriching for circRNAs. CIRCexplorer2 for the analysis of such experiments
    • Zhang et al., “Diverse Alternative Back-Splicing and Alternative Splicing Landscape of Circular RNAs.”

Gene fusion

  • MINTIE - identifying novel, rare transcriptional variants in cancer RNA-seq data. Detects fusions, transcribed structural variants (>=7bp), novel splice variants (flanked by >=20bp), complex variants (Figure 2). Filters, annotates, and prioritizes variants. Case(s) vs. control(s) analysis (single case vs. N controls). Four steps: transcriptome assembly of the case sample (SOAPdenovo-Trans), pseudo-alignment of cases and controls to an index composed of the assembled and reference transcripts (CHESS, Salmon), differential expression to identify upregulated novel features, and annotation of novel transcripts. Outperforms eight other variang detection methods on simulated and experimental datasets.
    Cmero, Marek, Breon Schmidt, Ian J. Majewski, Paul G. Ekert, Alicia Oshlack, and Nadia M. Davidson. “MINTIE: Identifying Novel Structural and Splice Variants in Transcriptomes Using RNA-Seq Data.” Genome Biology 22, no. 1 (December 2021): 296. https://doi.org/10.1186/s13059-021-02507-8.
  • MetaFusion - gene fusion caller by filtering and aggregating calls from multiple (7 by default) fusion callers (included in Docker/Singularity images, orchestrated by GenPipes). Results are summarized into new Common Fusion Format. Includes FusionAnnotator tool. Documentation.
    Paper Apostolides, Michael, Yue Jiang, Mia Husić, Robert Siddaway, Cynthia Hawkins, Andrei L Turinsky, Michael Brudno, and Arun K Ramani. “MetaFusion: A High-Confidence Metacaller for Filtering and Prioritizing RNA-Seq Gene Fusion Candidates.” Edited by Janet Kelso. Bioinformatics 37, no. 19 (October 11, 2021): 3144–51. https://doi.org/10.1093/bioinformatics/btab249.
  • CICERO - gene fusion detection, uses longer (>75bp) reads, a local assembly-based. Prioritizes candidates. Outperforms ChimeraScan, deFuse, FusionCatcher, Arriba on TCGA brain tumor data. FusionEditor imports CICERO's output for visualization. Imports paired-end FASTQs or aligned BAMs. Supports hg19 only. Web, GitHub

  • annoFuse - an R package for standartization, filtering and annotation of fusion calls detected by STAR-Fusion and Arriba, two best methods for fusion detection. Visualization options. Applied to OpenPBTA data.

  • ChimerDB is a comprehensive database of fusion genes encompassing analysis of deep sequencing data and manual curations. In this update, the database coverage was enhanced considerably by adding two new modules of TCGA RNA-Seq analysis and PubMed abstract mining

  • TUMOR FUSION GENE DATA PORTAL - Landscape of cancer-associated fusions using the Pipeline for RNA sequencing Data Analysis.

  • FusionScan – prediction of fusion genes from RNA-Seq data. RNA-seq blog post, GitHub

  • Arriba - Fast and accurate gene fusion detection from RNA-Seq data

  • FuSeq - fast fusion detection. Compared with FusionMap, TRUP, TopHat-Fusion, JAFFA, SOAPfuse

  • GeneFuse - Gene fusion detection and visualization

  • EricScript is a computational framework for the discovery of gene fusions in paired end RNA-seq data

Isoforms

CNVs and Structural variations

  • SuperFreq - CNV analysis from exome data adapted for RNA-seq data. Based on log fold-change variance estimation with the neighbour correction. R package, input - BAM files (reference normal needed), variant calls from samtools or other tools, output - visualization of CNAs, other variant-related plots

  • CaSpER - identification of CNVs from RNA-seq data, bulk and single-cell (full-transcript only, like SMART-seq). Utilized multi-scale smoothed global gene expression profile and B-allele frequency (BAF) signal profile, detects concordant shifts in signal using a 5-state HMM (homozygous deletion, heterozygous deletion, neutral, one-copy-amplification, high-copy-amplification). Reconstructs subclonal CNV architecture for scRNA-seq data. Tested on GBM scRNA-seq, TCGA, other. Compared with HoneyBADGER. R code and tutorials

  • CNAPE - CNV detection from RNA-seq data. Regularized logistic regression (Lasso), trained on TCGA samples. Prediction accuracy >80%. R implementation

  • CNVkit-RNA - CNV estimation from RNA-seq data. Improved moving average approach, corrects for GC content, gene expression level, gene length, correlation of gene expression and CNV (estimated from TCGA). Docs, Video tutorial

  • InferCNV - Inferring copy number alterations from tumor single cell RNA-Seq data. R package. GitHub wiki. Part of Trinity Cancer Transcriptome Analysis Toolkit

  • SQUID - transcriptomic structural variation caller. Genome segment graph, then rearrange segments so that as many read alignments as possible are concordant with the rearranged sequence. Compared with MUMmer3, DELLY2, LUMPY in simulated settings, and with SOAPfuse, deFuse, FusionCatcher, JAFFA, INTEGRATE tools using real data

  • transindel - Indel caller for DNA-seq or RNA-seq

Networks

  • ANANSE (ANalysis Algorithm for Networks Specified by Enhancers) - gene regulatory network inference using TF binding profiles. Missing TF binding profiles predicted from cis-regulatory enhancer activity (H3K27ac, ATAC-seq, EP300), TF motif scores, average ChIP-seq signal of REMAP peaks in enhancers (logistic regression). Influence score - how well the expression differences between two cell types can be explained by a TF. Python implementation. Jupyter notebooks. ANANSE-inferred Tissue-specific networks, cell type-specific networks, GRNBoost2-inferred tissue-specific networks. Tweet.
    Paper Xu, Quan, Georgios Georgiou, Siebren Frölich, Maarten van der Sande, Gert Jan C Veenstra, Huiqing Zhou, and Simon J van Heeringen. “ANANSE: An Enhancer Network-Based Computational Approach for Predicting Key Transcription Factors in Cell Fate Determination.” Nucleic Acids Research 49, no. 14 (August 20, 2021): 7966–85. https://doi.org/10.1093/nar/gkab598.
  • MODifieR - R package wrapping 9 gene module inference methods from transcriptomics networks (WGCNA, DIAMOnD, DiffCoEx, MCODE, MODA, Module Discoverer, Clique-Sum, Correlation-Clique). Some methods include differential expression analysis. Consensus module detection. Docker. Vignette.
    Paper Weerd, Hendrik A de, Tejaswi V S Badam, David Martínez-Enguita, Julia Åkesson, Daniel Muthas, Mika Gustafsson, and Zelmina Lubovac-Pilav. “MODifieR: An Ensemble R Package for Inference of Disease Modules from Transcriptomics Networks.” Edited by Lenore Cowen. Bioinformatics 36, no. 12 (June 1, 2020): 3918–19. https://doi.org/10.1093/bioinformatics/btaa235.
  • corto - R package for correlation-based gene network and master regulator analysis. Can correct for CNVs. Uses RNA-seq or ATAC-seq data. Benchmarked against ARACNE-AP, minet, RTN. GitHub.
    Paper Mercatelli, Daniele, Gonzalo Lopez-Garcia, and Federico M Giorgi. “[Corto: A Lightweight R Package for Gene Network Inference and Master Regulator Analysis](https://doi.org/10.1093/bioinformatics/btaa223),” Bioinformatics, Volume 36, Issue 12, 15 June 2020
  • GENIE3 - random forest regression detection of gene modules. Input - expression matrix, output - gene x gene square co-regulation matrix

  • PANDA networks - Tissue-Specific Gene Regulatory Networks constructed using PANDA

  • SCENIC networks - Tissue-specific networks inferred from single cell data using SCENIC

Transcription regulators

Integrative

  • Review of tools and methods for the integrative analysis of multiple omics data, cancer-oriented. Table 1 - multi-omics data repositories (TCGA, CPTAC, ICGC, CCLE, METABRIC, TARGET, Omics Discovery Index). Three broad areas of multi-omics analysis: 1. Disease subtyping and classification based on multi-omics profiles; 2. Prediction of biomarkers for various applications including diagnostics and driver genes for diseases; 3. Deriving insights into disease biology. Table 2 - software categorized by use case (PARADIGM, iClusterPlus, PSDF, BCC, MDI, SNF, PFA, PINSPlus, NEMO, mixOmics, moCluster, MCIA, JIVE, MFA, sMBPLS, T-SVD, Joint NMF). Brief description of each tool, links, exemplary publications. Table 3 - visualization portals (cBioPortal, Firebrowse, UCSC Xena, LinkedOmics, 3Omics, NetGestalt, OASIS, Paintomics, MethHC). Description of each, data types, analysis examples.

  • DIABLO - multi-omics analysis method. Overview of previous methods (SNF, Bayesian Consensus Clustering, NMF, JIVE, sGCCA, MOFA, others). Method extends sGCCA multivariate dimensionality reduction that uses SVD and selects co-expressed (correlated) variables from several omics datasets. Methods, model, iterative solution. Design matrix specifies which omics datasets are connected. Variable selection for biomarkers identification. Visualization options. Part of mixOmics R package, Documentation

    • Singh, Amrit, Casey P Shannon, Benoît Gautier, Florian Rohart, Michaël Vacher, Scott J Tebbutt, and Kim-Anh Lê Cao. “DIABLO: An Integrative Approach for Identifying Key Molecular Drivers from Multi-Omics Assays.” Edited by Inanc Birol. Bioinformatics 35, no. 17 (September 1, 2019): 3055–62. https://doi.org/10.1093/bioinformatics/bty1054.
  • MANCIE - matrix analysis and normalization by concordant information enhancement. Bias correction and data integration of distinct genomic profiles on the same samples. Match matrices by rows, run correlation for each row, replace the associated row with modified values using a PCA procedure, Methods. Tested on integration of DHS and gene expression data, TCGA and METABRIC data. R package

  • JIVE - Joint and Individual Variation Explained. Decomposition of (X) multiple (i) omics datasets into three terms: low-rank (constrained) matrices capturing joint variation (J), plus structured variation (A_i) and residual noise. Data are row-centered and scaled by its total variation. Main constrain: the rows of joint and individual matrices should be orthogonal. Estimate matrices by iteratively minimizing ||R||^2 (R=X-J-A). Relationship to PCA, CCA, PLS. Illustrated on TCGA GBM gene expression, methylation, and miRNA data, with interpretation. Matlab code, r.jive package

  • List of software packages for multi-omics analysis, by Mike Love. Slides for the talk "Assessing consistency of unsupervised multi-omics methods".

Classification

Visualization

  • chromoMap - an R package/function for visualizing BED-like data across chromosomes. Static and interactive (Shiny embeddable) plots, segment-, point-, barplot-, scatterplot visualization. Filters to color visualization by criteria. Colors, spacing, height/width - all customizable. Supports multiple organisms. Documentation, tutorial

  • genomation - a toolkit for annotation and visualization of genomic data, R package

  • karyoploteR - An R/Bioconductor package to plot arbitrary data along the genome

  • pcaExplorer - Interactive Visualization of RNA-seq Data Using a Principal Components Approach, R package

  • WIlsON - Web-based Interactive Omics VisualizatioN, accepts, text files, SummarizedExperiment datasets.

Data

  • ARCHS4 - Massive Mining of Publicly Available RNA-seq Data from Human and Mouse

  • cBioPortalData - cBioPortal data as MultiAssayExperiment objects, by Waldron Lab

  • curatedTCGAData - Curated Data From The Cancer Genome Atlas (TCGA) as MultiAssayExperiment objects, by Waldron Lab

  • gtexRNA - R package for retrieval of tissue-specific expression data from GTEx. By Sigve Nakken, website

  • GTEx Visualizations - web-based visualization tools for exploring tissue-specific gene expression and regulation

  • PINS - A novel method for data integration and disease subtyping

  • refine.bio - harmonized microarray and RNA-seq data for various organisms and conditions

  • recount2 - an R workflow to work with recount2 data

  • GREIN - re-analysis of RNA-seq datasets from GEO. Download processed data, visualization, power analysis, differential expression, functional enrichment analysis, connectivity analysis with LINCS L1000 data. GitHub, Docker image

  • DEE2 - Digital Expression Explorer - gene- and transcript-level processed data from multiple organisms, amenable for downstream analysis in R etc. getDEE2 R package to get the data

  • GEMMA - curated transcriptomic database, >10,000 studies, ~34% are brain-related. Query genes, phenotypes, experiments, search for coexpression, differential expression. Processing methods, batch correction. Online access, API, R package. GitHub

Genes

  • Enrichr - enrichment analysis, gene search, term search. Libraries for various signatures are available for download

  • CellMarker - Cell markers of different cell types from different tissues in human and mouse.

  • A list of updated 1439 DNA-binding transcription factors from re-annotation study of transcription factors in Gene Ontology annotations

  • List of gene lists for genomic analyses - GitHub repo with tab-separated annotated lists

  • CREEDS - database of manually (and automatically) extracted gene signatures. Single gene perturbations, disease signatures, single drug perturbations. Batch effect correction, when necessary. Overall, good agreement with MSigDb C2. Characteristic Direction (CD) method to detect differential genes. API access in R

  • DIOPT - finding ortholog genes among human, mouse, zebrafish, C. elegans, Drozophila, S. cerevisiae. Integration with human GWAS allows to search for orthologs for diseases and traits. Batch conversion, filtering. DIOPT-DIST - DIOPT Diseases and Traits.

    Paper Hu, Yanhui, Ian Flockhart, Arunachalam Vinayagam, Clemens Bergwitz, Bonnie Berger, Norbert Perrimon, and Stephanie E Mohr. “An Integrative Approach to Ortholog Prediction for Disease-Focused and Other Functional Studies.” BMC Bioinformatics 12, no. 1 (December 2011): 357. https://doi.org/10.1186/1471-2105-12-357.

Misc

  • RRHO - Rank–rank Hypergeometric Overlap between two gene lists ranked by the degree of differential expression (e.g., signed -log10 p-value). 2D analog of GSEA. Identifies and visualizes areas of significant overlap by determining the degree of statistical enrichment using the hypergeometric distribution while sliding across all possible thresholds through the two ranked lists. Multiple testing correction (FWER, Benjamini-Yekutieli). Website and R/Bioconductor RRHO package.
    Paper Plaisier, Seema B, Richard Taschereau, Justin A Wong, and G Graeber. “Rank–Rank Hypergeometric Overlap: Identification of Statistically Significant Overlap between Gene-Expression Signatures.” Nucleic Acids Research 38, no. 17 (2010): 17. https://doi.org/10.1093/nar/gkq636
  • gtftk - A python package and a set of shell commands to handle GTF files. Subcommands for editing GTF files, getting information and summary statistics, selecting by various criteria, converting BED to gtf and other formats, annotating by closest genes and more, getting sequences, coordinates of specific gene elements, coverage profile and other bigWig operations.

  • Recommended Coverage and Read Depth for NGS Applications, by GenoHub. And, their NGS Handbook

  • BioJupies - analysis of GEO/GTEx data or your own gene expression table/FASTQ in autogenerated Jupyter notebook. Rich set of tools for EDA (PCA, Clustergrammer, Library size analysis), Differential expression analysis (Volcano, MA plots), Enrichment analysis (Enrichr, GO, Pathway, TF, Kinase, miRNA enrichments), L1000 signatures. Best suited for two-group analysis. Includes Methods for the selected tools

  • HGNChelper - Handy Functions for Working with HGNC Gene Symbols and Affymetrix Probeset Identifiers

  • tximport - importing transcript abundance datasets from Salmon, Sailfish, kallisto, RSEM, and differential analysis

  • rpkmforgenes - a Python script for calculating gene expression for RNA-Seq data

  • Python interface to access reference genome features from Ensembl, e.g., genes, transcripts, and exons

  • TPMcalculator - converts gene counts to TPM using transcript information from a GTF file. TPM vs. FPKM correlation for validation. C/C++ command line tool, Docker image, CWL workflow

  • Multi-omics madness picture, Tweet, download

Multi-omics madness

About

A continually expanding collection of RNA-seq tools

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published