- So you want to be a computational biologist?
- Scientific computing: Code alert Nature News.
- Practical computing for biologist. One of my first books to get me started in coding.
- ModernDive An Introduction to Statistical and Data Sciences via R
- The Biologist’s Guide to Computing A book written by @tjelvar_olsson
- An Introduction To Applied Bioinformatics Interactive lessons in bioinformatics
- The Biostar Handbook: A Beginner's Guide to Bioinformatics I am honored to be a co-author of this book. The ChIP-seq section is going to be released by the mid of 2017.
- Beginner's Handbook to Next Generation Sequencing Everything you need to know about starting a sequencing project
- A New Online Computational Biology Curriculum PLOS genetics paper.
- PH525x series - Biomedical Data Science The best course to get you started with genomics using R. I have taken 3 times for the same course to get a deep understanding of the concepts and R commands.
- Expanding the computational toolbox for mining cancer genomes Nature Review.
- some repos from command line to rstats and github
- 2016 review Coming of age: ten years of next-generation sequencing technologies
- Cancer genomics — from bench to bedside: review papers from Nature
- applied computational genomics by Aaron Quinlan, the creator of bedtools and many other cool tools.
- BMMB 852: Applied Bioinformatics (Fall, 2016) by Istvan Albert, the creator of biostars.
- JHU EN.600.649: Computational Genomics: Applied Comparative Genomics by Michael Schatz.
If you are from fields outside of biology, places to get you started:
- Tales from the Genome A course by Udacity and 23andMe.
- The Biology of Cancer A classic text book by Robert A. Weinberg. A must read for all cancer biologists.
- Molecular Biology of the Cell A text book
- Learn Genetics from University of Utah learning center.
- seeing theory The goal of the project is to make statistics more accessible to a wider range of students through interactive visualizations.
- Points of Significance: Interpreting P values
- statistics for biologists
-
A Bioinformatician's UNIX Toolbox from Heng Li
-
command line bootcamp teaches you unix command step by step
-
Unix in your browser. Maybe useful for teaching bash?
-
Use the Unofficial Bash Strict Mode (Unless You Looove Debugging)
-
Defensive BASH Programming very good read for bash programming.
-
process substitution: Using Names Pipes and Process Substitution in Bioinformatics Handy Bash feature: Process Substitution
-
Commonly used commands for PBS scheduler:Monitoring and Managing Your Job
-
test your unix skills at cmd challenge
-
people say awk is not part of bioinformats :) Still very useful parsing plain text files. Steve's Awk Academy
-
intro-bioinformatics: Website and slides for intro to bioinformatics class at Fred Hutch
-
tmate:Instant terminal sharing
-
tmux is a terminal multiplexer similar to
screen
but have more features. tmux cheatsheet
tmux config
tmux install without root
Theory and quick reference
There are 3 file descriptors, stdin, stdout and stderr (std=standard).
Basically you can:
redirect stdout to a file redirect stderr to a file redirect stdout to a stderr redirect stderr to a stdout redirect stderr and stdout to a file redirect stderr and stdout to stdout redirect stderr and stdout to stderr 1 'represents' stdout and 2 stderr. A little note for seeing this things: with the less command you can view both stdout (which will remain on the buffer) and the stderr that will be printed on the screen, but erased as you try to 'browse' the buffer.
- stdout 2 file
This will cause the ouput of a program to be written to a file.
ls -l > ls-l.txt
Here, a file called 'ls-l.txt' will be created and it will contain what you would see on the screen if you type the command 'ls -l' and execute it.
- stderr 2 file
This will cause the stderr ouput of a program to be written to a file.
grep da * 2> grep-errors.txt
Here, a file called 'grep-errors.txt' will be created and it will contain what you would see the stderr portion of the output of the 'grep da *' command.
- stdout 2 stderr
This will cause the stderr ouput of a program to be written to the same filedescriptor than stdout.
grep da * 1>&2
Here, the stdout portion of the command is sent to stderr, you may notice that in differen ways.
- stderr 2 stdout
This will cause the stderr ouput of a program to be written to the same filedescriptor than stdout.
grep * 2>&1
Here, the stderr portion of the command is sent to stdout, if you pipe to less, you'll see that lines that normally 'dissapear' (as they are written to stderr) are being kept now (because they're on stdout).
- stderr and stdout 2 file
This will place every output of a program to a file. This is suitable sometimes for cron entries, if you want a command to pass in absolute silence.
rm -f $(find / -name core) &> /dev/null
This (thinking on the cron entry) will delete every file called 'core' in any directory. Notice that you should be pretty sure of what a command is doing if you are going to wipe it's output.
- change permissions of files
each digit is for: user, group and other.
chmod 754 myfile
: this means the user has read, write and execute permssion; member in the same group has read and execute permission but no write permission; other people in the world only has read permission.
4 stands for "read",
2 stands for "write",
1 stands for "execute", and
0 stands for "no permission."
So 7 is the combination of permissions 4+2+1 (read, write, and execute), 5 is 4+0+1 (read, no write, and execute), and 4 is 4+0+0 (read, no write, and no execute).
It is sometimes hard to remember. one can use the letter:The letters u, g, and o stand for "user", "group", and "other"; "r", "w", and "x" stand for "read", "write", and "execute", respectively.
chmod u+x myfile
chmod g+r myfile
- scary-excel-stories
- convert xlsx to csv: xlsx2csv
- csvkit
- GNU datamash
- tabtk Toolkit for processing TAB-delimited format from Heng Li, the author of
Samtools
,BWA
and many others. - Another cross-platform, efficient, practical and pretty CSV/TSV toolkit in Golang
- visidata A console spreadsheet tool for discovering and arranging data
It is really important to name your files correctly! see a ppt by Jenny Bryan.
Three principles for (file) names:
- Machine readable (do not put special characters and space in the name)
- Human readable (Easy to figure out what the heck something is, based on its name, add slug)
- Plays well with default ordering:
-
Put something numeric first
-
Use the ISO 8601 standard for dates (YYYY-MM-DD)
-
Left pad other numbers with zeros
Good naming of your files can help you to extract meta data from the file name
- dirdf Create tidy data frames of file metadata from directory and file names.
> dir("examples/dataset_1/")
[1] "2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A01.csv"
[2] "2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A02.csv"
[3] "2014-02-26_BRAFWTNEG_FFPEDNA-CRC-1-41_D08.csv"
[4] "2014-03-05_BRAFWTNEG_FFPEDNA-CRC-REPEAT_H03.csv"
[5] "2016-04-01_BRAFWTNEG_FFPEDNA-CRC-1-41_E12.csv"
> library("dirdf")
> dirdf("examples/dataset_1/", template="date_assay_experiment_well.ext")
date assay experiment well ext pathname
1 2013-06-26 BRAFWTNEG Plasmid-Cellline-100 A01 csv 2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A01.csv
2 2013-06-26 BRAFWTNEG Plasmid-Cellline-100 A02 csv 2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A02.csv
3 2014-02-26 BRAFWTNEG FFPEDNA-CRC-1-41 D08 csv 2014-02-26_BRAFWTNEG_FFPEDNA-CRC-1-41_D08.csv
4 2014-03-05 BRAFWTNEG FFPEDNA-CRC-REPEAT H03 csv 2014-03-05_BRAFWTNEG_FFPEDNA-CRC-REPEAT_H03.csv
Using these tool will greatly improve your working efficiency and get rid of most of your for loops
.
- xargs
- GNU parallel. one of my post here
- gxargs by Brent Pedersen. Written in GO.
- future: Unified Parallel and Distributed Processing in R for Everyone
- Essence of linear algebra
- statistics for biologists A collection of Nature articles on statistics in biology.
- biobroom:Turn Bioconductor objects into tidy data frames
- readr
- tidyr
- purrr tutorial by jenny bryan. functional programming in R.
- janitor simple tools for data cleaning in R.
- dplry
- replyr An R package for fluid use of dplyr.
- Introduction of Parameterized dplyr expression using replyr
- wrapr wraps R functions debugging and better standard evaluation.
Let
function. blog post wrapr: for sweet R code - csv fingerprint
- ggplot2
- ggplot2 tips
- A List of ggplot2 extensions
- nice ggplot themes
- colourpicker A colour picker tool for Shiny and for selecting colours in plots (in R). R blogger post
- ggforce: facet_zoom() to zoom in part of the figure! and many more.
- ggedit – interactive ggplot aesthetic and theme editor.
- trelliscopejs is an R package that brings faceted visualizations to life while plugging in to common analytical workflows like ggplot2 or the “tidyverse”.
- Plotting background data for groups with ggplot2
- Ordering categories within ggplot2 facets
- plotly for R
- rematch2Tidy output from regular expression matches
- Make waffle (square pie) charts in R
- Bring the power of R to the command line: littler Rio A wrapper by Jeroen Janssens, the author of data science at the command line
- htmlwidgets for R including
d3heatmap
for interactive heatmaps. - focus() on correlations of some variables with many others
- Explore correlations in R with corrr
- Unit test in R
- sinaplot: an enhanced chart for simple and truthful representation of single observations over multiple classes.
ggforce
hasgeom_sina
for the same purpose. - complexHeatmaps
- superheat Another heatmap package worth learning besides
ComplexHeatmap
. Not as flexiable as ComplexHeatmap, but can be handy when the function you want has been implemented. - iheatmapr is an R package for building complex, interactive heatmaps using modular building blocks.
- heatmap:gapmap
- dendsort:Modular Leaf Ordering Methods for Dendrogram Nodes
- dendextend
- Interactive Heat Maps for R Using plotly
- Multiple plots on a page
- ggExtra
- cowplot -- An add-on to the ggplot2 plotting package
- ggplot2 - Easy way to mix multiple graphs on the same page - R software and data visualization
- Extract Tables from PDFs
- Alternative to venndiagram! upSetR
- hierarchicalSets
- Intervene is a tool for intersection and visualization of multiple gene or genomic region sets.
- In-depth introduction to machine learning in 15 hours of expert videos
- Data Analysis and Visualization Using RThis is a course that combines video, HTML and interactive elements to teach the statistical programming language R.
- These are the course notes for the Monash Bioinformatics Platform’s “R More” course
- gitbook: Getting used to R, RStudio, and R Markdown
- Efficient R programming
- R for Data Science by Garrett Grolemund and Hadley Wickham
- Lightning Fast Serialization of Data Frames for R faster than
data.table
,feather
. - Rpub post: Handling large data sets in R
- R package primer: a minimal tutorial
- Write your own R package
- R packages a book by Hadley Wickham.
- Developing R packages from Jeff leek.
- docopt.R tutorial
- python version
- Generate a CLI tool from a Python module/function
- Introducing Python Fire, a library for automatically generating command line interfaces
- Nature Methods point of view data visulization
- A tutorial for the free Inkscape cross-platform vector graphics editor
- gimp for bit-map based figures.
- 30 Python Language Features and Tricks You May Not Know About
- intermediatePython
- The Hitchhiker’s Guide to Python!
- Python 3 for Scientists
- Python FAQ: Why should I use Python 3?
- gitbook: Computational and Inferential Thinking; The Foundations of Data Science
- A collection of python courses online
- tpot:A Python tool that automatically creates and optimizes machine learning pipelines using genetic programming.
- Easy to use Python API wrapper to plot charts with matplotlib, plotly, bokeh and more:chartpy creates a simple easy to use API to plot in a number of great Python chart libraries like plotly (via cufflinks), bokeh and matplotlib, with a unified interface. You simply need to change a single keyword to change which chart engine to use (see below), rather than having to learn the low level details of each library.
- Top 8 resources for learning data analysis with pandas
- Jupyter Notebooks for the Python Data Science Handbook
There are many online web based tools for visualization of (cancer) genomic data. I put my collections here. I use R for visulization. see a nice post by using python by Radhouane Aniba:Genomic Data Visualization in Python
- UCSC cancer genome browser It has many data including TCGA data buit in, and can be very handy for both bench scientist and bioinformaticians.
- UCSC Xena. A new tool developed by UCSC team as well. Poteintially very useful, but need more tutorials to follow.
- UCSC genome browser. One of the most famous genome browser and my favoriate. Every person studying genetics, genomics and molecular biology needs to know how to use it. Tutorials from OpenHelix.
- Epiviz 3 is an interactive visualization tool for functional genomics data. It supports genome navigation like other genome browsers, but allows multiple visualizations of data within genomic regions using scatterplots, heatmaps and other user-supplied visualizations.
- Mutation Annotation & Genome Interpretation TCGA: MAGA
- GeneProteinViz (GPViz) is a versatile Java-based software for dynamic gene-centered visualization of genomic regions and/or variants.
- ProteinPaint: Web Application for Visualizing Genomic Data The software developed for this project highlights critical attributes about the mutations, including the form of protein variant (e.g. the new amino acid as a result of missense mutation), the name of sample from which the mutation was identified, whether the mutation is somatic or germline,
- DisGeNET is a discovery platform integrating information on gene-disease associations (GDAs) from several public data sources and the literature
- Cancer3D is a database that unites information on somatic missense mutations from TCGA and CCLE, allowing users to explore two different cancer-related problems at the same time: drug sensitivity/biomarker identification and prediction of cancer drivers
- clinical intepretations of variants in cancer
- R Wrapper for DGIdb Drug-gene interaction database.
- BioGrid Welcome to the Biological General Repository for Interaction Datasets
- The IUPHAR/BPS Guide to PHARMACOLOGY in 2016: towards curated quantitative interactions between 1300 protein targets and 6000 ligands
- Public data and open source tools for multi-assay genomic investigation of disease
- cancer cell metabolism genes
- oncogenes and tumor suppressors biostar post and TSgene
- DriverDB: A database for cancer driver gene/mutation
- Interaction of genes: GENEMANIA
- DATA DISCOVERY PLATFORM:Designed for researchers who use, share and collaborate on human genomic data
- zenodo: research shared
- dataMed biomedical and healthCAre Data Discovery Index Ecosystem.
- repostive Discover a better way of searching for genomic data.
- The NCI's Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine. A copy of TCGA and TARGET data? Data Release Notes
- OASIS genomics from Pfizer. processed data from TCGA, CCLE, GTEx.
- TCGA alternative splicing
- ISOexpresso: a web-based platform for isoform-level expression analysis in human cancer
- omics databse The Omics Discovery Index (OmicsDI) provides dataset discovery across a heterogeneous, distributed group of Transcriptomics, Genomics, Proteomics and Metabolomics data resources spanning eight repositories in three continents and six organisations, including both open and controlled access data resources. The resource provides a short description of every dataset: accession, description, sample/data protocols biological evidences, publication, etc. Based on these metadata, OmicsDI provides extensive search capabilities, as well as identification of related datasets by metadata and data content where possible. In particular, OmicsDI identifies groups of related, multi-omics datasets across repositories by shared identifiers.
- MAGI Mutation Annotation &Genome Interpretation for TCGA data.
- How to successfully apply for access to dbGaP
- AnnotationHub bioconductor package for TCGA and epigenome roadmap, ENCODE project.
- TCGAbiolinks bioconductor package.
- GenomicDataCommons bioc package to acess GDC.
- RTCGA bioconductor
- f1000 workflow paper TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages
- paper Data mining The Cancer Genome Atlas in the era of precision cancer medicine
- CrossHub: a tool for multi-way analysis of The Cancer Genome Atlas (TCGA) in the context of gene expression regulation mechanisms.
- Ferret, a User-Friendly Java Tool to Extract Data from the 1000 Genomes Project
- EGA:European Genome-phenome Archive
- survival curves for TCGA data: a simple web tool
- AACR Project GENIE data guide
- High-dimensional genomic data bias correction and data integration using MANCIE correct batch effects for data from different sequencing methods. (RNAseq vs ChIPseq)
- PH525x series - Biomedical Data Science. Learn R and bioconductor.
- PCA, MDS, k-means, Hierarchical clustering and heatmap. I wrote it.
- A tale of two heatmaps. I wrote it.
- Heatmap demystified. I wrote it.
- Cluster Analysis in R - Unsupervised machine learning very practical intro on STHDA website.
- I wrote on PCA, and heatmaps on Rpub
- A most read for clustering analysis for high-dimentional biological data:Avoiding common pitfalls when clustering biological data
- How does gene expression clustering work? A must read for clustering.
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>See https://t.co/yxCb85ctL1: "MDS best choice for preserving outliers, PCA for variance, & T-SNE for clusters" @mikelove @AndrewLBeam
— Rileen Sinha (@RileenSinha) August 25, 2016
paper: Outlier Preservation by Dimensionality Reduction Techniques
"MDS best choice for preserving outliers, PCA for variance, & T-SNE for clusters"
-
Rtsne R package for T-SNE
-
rtsne An R package for t-SNE (t-Distributed Stochastic Neighbor Embedding) a bug was in
rtsne
: https://gist.github.com/mikelove/74bbf5c41010ae1dc94281cface90d32 -
PHATE dimensionality reduction method paper: http://biorxiv.org/content/early/2017/03/24/120378
-
Survival analysis of TCGA patients integrating gene expression (RNASeq) data
-
Tutorial: Machine Learning For Cancer Classification. It has four parts.
- The Open Source Data Science Masters
- Path to a free self-taught education in Data Science!
- Path to a free self-taught education in Bioinformatics!
- Udacity
- Coursera
- edx
- git intro by github
- learn git branching
- A Git Workflow Walkthrough Series
- paper:A Quick Introduction to Version Control with Git and GitHub
- paper:Ten Simple Rules for Taking Advantage of Git and GitHub
- software carpentry git novice lesson
- git best practise
- git-hub cheatsheet
- oh shit git! Git is hard: screwing up is easy, and figuring out how to fix your mistakes is fucking impossible. Git documentation has this chicken and egg problem where you can't search for how to get yourself out of a mess, unless you already know the name of the thing you need to know about in order to fix your problem.
- How to undo (almost) anything with Git
Automation wins in the long run.
STEP 6 is usually missing!
The pic was downloaded from http://biobungalow.weebly.com/bio-bungalow-blog/everybody-knows-the-scientific-method
- Awesome youtube video for reproducible workflow
- Reproducibility starts at home A series of blog posts by Jon Zelner.
- The hard road to reproducibility commentary on Science Magzine.
- Five selfish reasons to work reproducibly Genome Biology paper.
- Make lessons from software carpentry
- biomake GNU-Make-like utility for managing builds and complex workflows.
- STAT545 Automating data analysis pipelines
- Existing Workflow systems
- Workflow management software for pipeline development in NGS
- pipelines
- biostar post:Job Manager to parallelize otherwise consecutive bash scripts
- paper:A review of bioinformatic pipeline frameworks
- initial steps toward reproducible research
- JupyterLab: the next generation of the Jupyter Notebook
- R notebook
- BEAKER THE DATA SCIENTIST'S LABORATORY
- [nteract] notebook (https://nteract.io/)
- A video by Dr.Keith A. Baggerly from MD Anderson The Importance of Reproducible Research in High-Throughput Biology very interesting, and Keith is really a fun guy!
- paper: Ten Simple Rules for Reproducible Computational Research
- open-research
- Good Enough Practices in Scientific Computing We present a set of computing tools and techniques that every researcher can and should adopt. These recommendations synthesize inspiration from our own work, from the experiences of the thousands of people who have taken part in Software Carpentry and Data Carpentry workshops over the past six years, and from a variety of other guides. Unlike some other guides, our recommendations are aimed specifically at people who are new to research computing. Well worth reading!
- A Quick Guide to Organizing Computational Biology Projects A must read for computational biologists!
- Ten Simple Rules for Digital Data Storage
I am using snakemake and so far is very happy about it!
- Have you ever had problem to reuse one of your own published figures due to copyright of the journal? Here is the solution! from @LorenaABarba
As an early adopter of the Figshare repository, I came up with a strategy that serves both our open-science and our reproducibility goals, and also helps with this problem: for the main results in any new paper, we would share the data, plotting script and figure under a CC-BY license, by first uploading them to Figshare.
- Survival plots have never been so informative: survminer package
- posts for survival analysis:
** Survival Analysis - 1 KM estimator
** Survival Analysis - 2 Cox's proportional hazards model
** Overall Survival Curves for TCGA and Tothill by RD Status
** Survival analysis of TCGA patients integrating gene expression (RNASeq) data - survminer
- slack:A messaging app for teams.
- Ryver.
- Trello lets you work more collaboratively and get more done.
- densityCut: an efficient and versatile topological approach for automatic clustering of biological data
- Interactive visualisation and fast computation of the solution path: convex bi-clustering by Genevera Allen cvxbiclustr and the clustRviz package coming.
- CRISPR GENOME EDITING MADE EASY
- CRISPR design from Japan
- CRISPResso:Analysis of CRISPR-Cas9 genome editing outcomes from deep sequencing data
- CRISPR-DO: A whole genome CRISPR designer and optimizer in human and mouse
- CCTop - CRISPR/Cas9 target online predictor
- DESKGEN
- Genome-wide Unbiased Identifications of DSBs Evaluated by Sequencing (GUIDE-seq) is a novel method the Joung lab has developed to identify the off-target sites of CRISPR-Cas RNA-guided Nucleases
- WTSI Genome Editing (WGE) is a website that provides tools to aid with genome editing of human and mouse genomes