GitHub - glianeuronc123/getting-started-with-genomics-tools-and-resources: Unix, R and python tools for genomics

General

So you want to be a computational biologist?
Scientific computing: Code alert Nature News.
Practical computing for biologist. One of my first books to get me started in coding.
ModernDive An Introduction to Statistical and Data Sciences via R
The Biologist’s Guide to Computing A book written by @tjelvar_olsson
An Introduction To Applied Bioinformatics Interactive lessons in bioinformatics
The Biostar Handbook: A Beginner's Guide to Bioinformatics I am honored to be a co-author of this book. The ChIP-seq section is going to be released by the mid of 2017.
Beginner's Handbook to Next Generation Sequencing Everything you need to know about starting a sequencing project
A New Online Computational Biology Curriculum PLOS genetics paper.
PH525x series - Biomedical Data Science The best course to get you started with genomics using R. I have taken 3 times for the same course to get a deep understanding of the concepts and R commands.
Expanding the computational toolbox for mining cancer genomes Nature Review.
some repos from command line to rstats and github
2016 review Coming of age: ten years of next-generation sequencing technologies
Cancer genomics — from bench to bedside: review papers from Nature

coursese

applied computational genomics by Aaron Quinlan, the creator of bedtools and many other cool tools.
BMMB 852: Applied Bioinformatics (Fall, 2016) by Istvan Albert, the creator of biostars.
JHU EN.600.649: Computational Genomics: Applied Comparative Genomics by Michael Schatz.

Some biology

If you are from fields outside of biology, places to get you started:

Tales from the Genome A course by Udacity and 23andMe.
The Biology of Cancer A classic text book by Robert A. Weinberg. A must read for all cancer biologists.
Molecular Biology of the Cell A text book
Learn Genetics from University of Utah learning center.

Some statistics

seeing theory The goal of the project is to make statistics more accessible to a wider range of students through interactive visualizations.
Points of Significance: Interpreting P values
statistics for biologists

Linux commands

A Bioinformatician's UNIX Toolbox from Heng Li
Linux command line exercises for NGS data processing
command line bootcamp teaches you unix command step by step
Unix in your browser. Maybe useful for teaching bash?
A Book for Anyone to Get Started with Unix
bash one-liners for bioinformatics
some of my bash one-liner collections
Use the Unofficial Bash Strict Mode (Unless You Looove Debugging)
Defensive BASH Programming very good read for bash programming.
Better Bash Scripting in 15 Minutes
bash pitfalls
Advancing in the Bash Shell
Bash tips
Bash by example
process substitution: Using Names Pipes and Process Substitution in Bioinformatics Handy Bash feature: Process Substitution
NGS Advanced Beginner/Intermediate Shell
Commonly used commands for PBS scheduler:Monitoring and Managing Your Job
test your unix skills at cmd challenge
people say awk is not part of bioinformats :) Still very useful parsing plain text files. Steve's Awk Academy
intro-bioinformatics: Website and slides for intro to bioinformatics class at Fred Hutch
tmate:Instant terminal sharing
tmux is a terminal multiplexer similar to screen but have more features. tmux cheatsheet
tmux config
tmux install without root
All about redirection

Theory and quick reference

There are 3 file descriptors, stdin, stdout and stderr (std=standard).

Basically you can:

redirect stdout to a file redirect stderr to a file redirect stdout to a stderr redirect stderr to a stdout redirect stderr and stdout to a file redirect stderr and stdout to stdout redirect stderr and stdout to stderr 1 'represents' stdout and 2 stderr. A little note for seeing this things: with the less command you can view both stdout (which will remain on the buffer) and the stderr that will be printed on the screen, but erased as you try to 'browse' the buffer.

stdout 2 file

This will cause the ouput of a program to be written to a file.

     ls -l > ls-l.txt

Here, a file called 'ls-l.txt' will be created and it will contain what you would see on the screen if you type the command 'ls -l' and execute it.

stderr 2 file

This will cause the stderr ouput of a program to be written to a file.

     grep da * 2> grep-errors.txt

Here, a file called 'grep-errors.txt' will be created and it will contain what you would see the stderr portion of the output of the 'grep da *' command.

stdout 2 stderr

This will cause the stderr ouput of a program to be written to the same filedescriptor than stdout.

     grep da * 1>&2

Here, the stdout portion of the command is sent to stderr, you may notice that in differen ways.

stderr 2 stdout

This will cause the stderr ouput of a program to be written to the same filedescriptor than stdout.

     grep * 2>&1

Here, the stderr portion of the command is sent to stdout, if you pipe to less, you'll see that lines that normally 'dissapear' (as they are written to stderr) are being kept now (because they're on stdout).

stderr and stdout 2 file

This will place every output of a program to a file. This is suitable sometimes for cron entries, if you want a command to pass in absolute silence.

     rm -f $(find / -name core) &> /dev/null

This (thinking on the cron entry) will delete every file called 'core' in any directory. Notice that you should be pretty sure of what a command is doing if you are going to wipe it's output.

change permissions of files
each digit is for: user, group and other.

chmod 754 myfile: this means the user has read, write and execute permssion; member in the same group has read and execute permission but no write permission; other people in the world only has read permission.

4 stands for "read",
2 stands for "write",
1 stands for "execute", and
0 stands for "no permission."
So 7 is the combination of permissions 4+2+1 (read, write, and execute), 5 is 4+0+1 (read, no write, and execute), and 4 is 4+0+0 (read, no write, and no execute).

It is sometimes hard to remember. one can use the letter:The letters u, g, and o stand for "user", "group", and "other"; "r", "w", and "x" stand for "read", "write", and "execute", respectively.

chmod u+x myfile
chmod g+r myfile

Do not give me excel files!

scary-excel-stories
convert xlsx to csv: xlsx2csv
csvkit
GNU datamash
tabtk Toolkit for processing TAB-delimited format from Heng Li, the author of Samtools, BWA and many others.
Another cross-platform, efficient, practical and pretty CSV/TSV toolkit in Golang
visidata A console spreadsheet tool for discovering and arranging data

How to name files

It is really important to name your files correctly! see a ppt by Jenny Bryan.

Three principles for (file) names:

Machine readable (do not put special characters and space in the name)
Human readable (Easy to figure out what the heck something is, based on its name, add slug)
Plays well with default ordering:

Put something numeric first
Use the ISO 8601 standard for dates (YYYY-MM-DD)
Left pad other numbers with zeros

Good naming of your files can help you to extract meta data from the file name

dirdf Create tidy data frames of file metadata from directory and file names.

> dir("examples/dataset_1/")
[1] "2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A01.csv"
[2] "2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A02.csv"
[3] "2014-02-26_BRAFWTNEG_FFPEDNA-CRC-1-41_D08.csv"
[4] "2014-03-05_BRAFWTNEG_FFPEDNA-CRC-REPEAT_H03.csv"
[5] "2016-04-01_BRAFWTNEG_FFPEDNA-CRC-1-41_E12.csv"

> library("dirdf")
> dirdf("examples/dataset_1/", template="date_assay_experiment_well.ext")
        date     assay           experiment well ext                                          pathname
1 2013-06-26 BRAFWTNEG Plasmid-Cellline-100  A01 csv 2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A01.csv
2 2013-06-26 BRAFWTNEG Plasmid-Cellline-100  A02 csv 2013-06-26_BRAFWTNEG_Plasmid-Cellline-100_A02.csv
3 2014-02-26 BRAFWTNEG     FFPEDNA-CRC-1-41  D08 csv     2014-02-26_BRAFWTNEG_FFPEDNA-CRC-1-41_D08.csv
4 2014-03-05 BRAFWTNEG   FFPEDNA-CRC-REPEAT  H03 csv   2014-03-05_BRAFWTNEG_FFPEDNA-CRC-REPEAT_H03.csv

parallelization

Using these tool will greatly improve your working efficiency and get rid of most of your for loops.

xargs
GNU parallel. one of my post here
gxargs by Brent Pedersen. Written in GO.
future: Unified Parallel and Distributed Processing in R for Everyone

Statistics

Essence of linear algebra
statistics for biologists A collection of Nature articles on statistics in biology.

packages for data wrangling, tidying and visualizing.

biobroom:Turn Bioconductor objects into tidy data frames
readr
tidyr
purrr tutorial by jenny bryan. functional programming in R.
janitor simple tools for data cleaning in R.
dplry
replyr An R package for fluid use of dplyr.
Introduction of Parameterized dplyr expression using replyr
wrapr wraps R functions debugging and better standard evaluation. Let function. blog post wrapr: for sweet R code
csv fingerprint
ggplot2
ggplot2 tips
A List of ggplot2 extensions
nice ggplot themes
colourpicker A colour picker tool for Shiny and for selecting colours in plots (in R). R blogger post
ggforce: facet_zoom() to zoom in part of the figure! and many more.
ggedit – interactive ggplot aesthetic and theme editor.
trelliscopejs is an R package that brings faceted visualizations to life while plugging in to common analytical workflows like ggplot2 or the “tidyverse”.
Plotting background data for groups with ggplot2
Ordering categories within ggplot2 facets
plotly for R
rematch2Tidy output from regular expression matches
Make waffle (square pie) charts in R
Bring the power of R to the command line: littler Rio A wrapper by Jeroen Janssens, the author of data science at the command line
htmlwidgets for R including d3heatmap for interactive heatmaps.
focus() on correlations of some variables with many others
Explore correlations in R with corrr
Unit test in R
sinaplot: an enhanced chart for simple and truthful representation of single observations over multiple classes. ggforce has geom_sina for the same purpose.
complexHeatmaps
superheat Another heatmap package worth learning besides ComplexHeatmap. Not as flexiable as ComplexHeatmap, but can be handy when the function you want has been implemented.
iheatmapr is an R package for building complex, interactive heatmaps using modular building blocks.
heatmap:gapmap
dendsort:Modular Leaf Ordering Methods for Dendrogram Nodes
dendextend
Interactive Heat Maps for R Using plotly
Multiple plots on a page
ggExtra
cowplot -- An add-on to the ggplot2 plotting package
ggplot2 - Easy way to mix multiple graphs on the same page - R software and data visualization
Extract Tables from PDFs
Alternative to venndiagram! upSetR
hierarchicalSets
Intervene is a tool for intersection and visualization of multiple gene or genomic region sets.
In-depth introduction to machine learning in 15 hours of expert videos
Data Analysis and Visualization Using RThis is a course that combines video, HTML and interactive elements to teach the statistical programming language R.
These are the course notes for the Monash Bioinformatics Platform’s “R More” course
gitbook: Getting used to R, RStudio, and R Markdown
Efficient R programming
R for Data Science by Garrett Grolemund and Hadley Wickham

Handling big data in R

Lightning Fast Serialization of Data Frames for R faster than data.table, feather.
Rpub post: Handling large data sets in R

Write your own R package

handling arguments at the command line

Genomics-visualization-tools

There are many online web based tools for visualization of (cancer) genomic data. I put my collections here. I use R for visulization. see a nice post by using python by Radhouane Aniba:Genomic Data Visualization in Python

UCSC cancer genome browser It has many data including TCGA data buit in, and can be very handy for both bench scientist and bioinformaticians.
UCSC Xena. A new tool developed by UCSC team as well. Poteintially very useful, but need more tutorials to follow.
UCSC genome browser. One of the most famous genome browser and my favoriate. Every person studying genetics, genomics and molecular biology needs to know how to use it. Tutorials from OpenHelix.
Epiviz 3 is an interactive visualization tool for functional genomics data. It supports genome navigation like other genome browsers, but allows multiple visualizations of data within genomic regions using scatterplots, heatmaps and other user-supplied visualizations.
Mutation Annotation & Genome Interpretation TCGA: MAGA
GeneProteinViz (GPViz) is a versatile Java-based software for dynamic gene-centered visualization of genomic regions and/or variants.
ProteinPaint: Web Application for Visualizing Genomic Data The software developed for this project highlights critical attributes about the mutations, including the form of protein variant (e.g. the new amino acid as a result of missense mutation), the name of sample from which the mutation was identified, whether the mutation is somatic or germline,

Databases

DisGeNET is a discovery platform integrating information on gene-disease associations (GDAs) from several public data sources and the literature
Cancer3D is a database that unites information on somatic missense mutations from TCGA and CCLE, allowing users to explore two different cancer-related problems at the same time: drug sensitivity/biomarker identification and prediction of cancer drivers
clinical intepretations of variants in cancer
R Wrapper for DGIdb Drug-gene interaction database.
BioGrid Welcome to the Biological General Repository for Interaction Datasets
The IUPHAR/BPS Guide to PHARMACOLOGY in 2016: towards curated quantitative interactions between 1300 protein targets and 6000 ligands
Public data and open source tools for multi-assay genomic investigation of disease
cancer cell metabolism genes
oncogenes and tumor suppressors biostar post and TSgene
DriverDB: A database for cancer driver gene/mutation
Interaction of genes: GENEMANIA
DATA DISCOVERY PLATFORM:Designed for researchers who use, share and collaborate on human genomic data
zenodo: research shared
dataMed biomedical and healthCAre Data Discovery Index Ecosystem.
repostive Discover a better way of searching for genomic data.
The NCI's Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine. A copy of TCGA and TARGET data? Data Release Notes
OASIS genomics from Pfizer. processed data from TCGA, CCLE, GTEx.
TCGA alternative splicing
ISOexpresso: a web-based platform for isoform-level expression analysis in human cancer
omics databse The Omics Discovery Index (OmicsDI) provides dataset discovery across a heterogeneous, distributed group of Transcriptomics, Genomics, Proteomics and Metabolomics data resources spanning eight repositories in three continents and six organisations, including both open and controlled access data resources. The resource provides a short description of every dataset: accession, description, sample/data protocols biological evidences, publication, etc. Based on these metadata, OmicsDI provides extensive search capabilities, as well as identification of related datasets by metadata and data content where possible. In particular, OmicsDI identifies groups of related, multi-omics datasets across repositories by shared identifiers.
MAGI Mutation Annotation &Genome Interpretation for TCGA data.
How to successfully apply for access to dbGaP

Large data consortium data mining

AnnotationHub bioconductor package for TCGA and epigenome roadmap, ENCODE project.
TCGAbiolinks bioconductor package.
GenomicDataCommons bioc package to acess GDC.
RTCGA bioconductor
f1000 workflow paper TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages
paper Data mining The Cancer Genome Atlas in the era of precision cancer medicine
CrossHub: a tool for multi-way analysis of The Cancer Genome Atlas (TCGA) in the context of gene expression regulation mechanisms.
Ferret, a User-Friendly Java Tool to Extract Data from the 1000 Genomes Project
EGA:European Genome-phenome Archive
survival curves for TCGA data: a simple web tool
AACR Project GENIE data guide

Integrative analysis

High-dimensional genomic data bias correction and data integration using MANCIE correct batch effects for data from different sequencing methods. (RNAseq vs ChIPseq)

Tutorials

PH525x series - Biomedical Data Science. Learn R and bioconductor.
PCA, MDS, k-means, Hierarchical clustering and heatmap. I wrote it.
A tale of two heatmaps. I wrote it.
Heatmap demystified. I wrote it.
Cluster Analysis in R - Unsupervised machine learning very practical intro on STHDA website.
I wrote on PCA, and heatmaps on Rpub
A most read for clustering analysis for high-dimentional biological data:Avoiding common pitfalls when clustering biological data
How does gene expression clustering work? A must read for clustering.

See https://t.co/yxCb85ctL1: "MDS best choice for preserving outliers, PCA for variance, & T-SNE for clusters" @mikelove @AndrewLBeam
— Rileen Sinha (@RileenSinha) August 25, 2016

paper: Outlier Preservation by Dimensionality Reduction Techniques

"MDS best choice for preserving outliers, PCA for variance, & T-SNE for clusters"

How to Use t-SNE Effectively
Rtsne R package for T-SNE
rtsne An R package for t-SNE (t-Distributed Stochastic Neighbor Embedding) a bug was in rtsne: https://gist.github.com/mikelove/74bbf5c41010ae1dc94281cface90d32
PHATE dimensionality reduction method paper: http://biorxiv.org/content/early/2017/03/24/120378
Survival analysis of TCGA patients integrating gene expression (RNASeq) data
Tutorial: Machine Learning For Cancer Classification. It has four parts.
Learning bash scripting for beginners
Bedtools tutorial
Gemini explores your vcf, and slides.
GNU parallel
A Tutorial on Principal Component Analysis
StatQuest: PCA clearly explained
Computing Workflows for Biologists: A Roadmap
Best Practices for Scientific Computing
Google's R Style Guide

MOOC(Massive Open Online Courses)

git and version control

git intro by github
learn git branching
A Git Workflow Walkthrough Series
paper:A Quick Introduction to Version Control with Git and GitHub
paper:Ten Simple Rules for Taking Advantage of Git and GitHub
software carpentry git novice lesson
git best practise
git-hub cheatsheet
oh shit git! Git is hard: screwing up is easy, and figuring out how to fix your mistakes is fucking impossible. Git documentation has this chicken and egg problem where you can't search for how to get yourself out of a mess, unless you already know the name of the thing you need to know about in order to fix your problem.
How to undo (almost) anything with Git

Automate your workflow, open science and reproducible research

Automation wins in the long run.

STEP 6 is usually missing!

The pic was downloaded from http://biobungalow.weebly.com/bio-bungalow-blog/everybody-knows-the-scientific-method

Awesome youtube video for reproducible workflow
Reproducibility starts at home A series of blog posts by Jon Zelner.
The hard road to reproducibility commentary on Science Magzine.
Five selfish reasons to work reproducibly Genome Biology paper.
Make lessons from software carpentry
biomake GNU-Make-like utility for managing builds and complex workflows.
STAT545 Automating data analysis pipelines
Existing Workflow systems
Workflow management software for pipeline development in NGS
pipelines
biostar post:Job Manager to parallelize otherwise consecutive bash scripts
paper:A review of bioinformatic pipeline frameworks
initial steps toward reproducible research
JupyterLab: the next generation of the Jupyter Notebook
R notebook
BEAKER THE DATA SCIENTIST'S LABORATORY
[nteract] notebook (https://nteract.io/)
A video by Dr.Keith A. Baggerly from MD Anderson The Importance of Reproducible Research in High-Throughput Biology very interesting, and Keith is really a fun guy!
paper: Ten Simple Rules for Reproducible Computational Research
open-research
Good Enough Practices in Scientific Computing We present a set of computing tools and techniques that every researcher can and should adopt. These recommendations synthesize inspiration from our own work, from the experiences of the thousands of people who have taken part in Software Carpentry and Data Carpentry workshops over the past six years, and from a variety of other guides. Unlike some other guides, our recommendations are aimed specifically at people who are new to research computing. Well worth reading!
A Quick Guide to Organizing Computational Biology Projects A must read for computational biologists!
Ten Simple Rules for Digital Data Storage

I am using snakemake and so far is very happy about it!

Have you ever had problem to reuse one of your own published figures due to copyright of the journal? Here is the solution! from @LorenaABarba

As an early adopter of the Figshare repository, I came up with a strategy that serves both our open-science and our reproducibility goals, and also helps with this problem: for the main results in any new paper, we would share the data, plotting script and figure under a CC-BY license, by first uploading them to Figshare.

Survival curve

Survival plots have never been so informative: survminer package
posts for survival analysis:
** Survival Analysis - 1 KM estimator
** Survival Analysis - 2 Cox's proportional hazards model
** Overall Survival Curves for TCGA and Tothill by RD Status
** Survival analysis of TCGA patients integrating gene expression (RNASeq) data
survminer

Organize research for a group

slack:A messaging app for teams.
Ryver.
Trello lets you work more collaboratively and get more done.

Clustering

densityCut: an efficient and versatile topological approach for automatic clustering of biological data
Interactive visualisation and fast computation of the solution path: convex bi-clustering by Genevera Allen cvxbiclustr and the clustRviz package coming.

CRISPR related

CRISPR GENOME EDITING MADE EASY
CRISPR design from Japan
CRISPResso:Analysis of CRISPR-Cas9 genome editing outcomes from deep sequencing data
CRISPR-DO: A whole genome CRISPR designer and optimizer in human and mouse
CCTop - CRISPR/Cas9 target online predictor
DESKGEN
Genome-wide Unbiased Identifications of DSBs Evaluated by Sequencing (GUIDE-seq) is a novel method the Joung lab has developed to identify the off-target sites of CRISPR-Cas RNA-guided Nucleases
WTSI Genome Editing (WGE) is a website that provides tools to aid with genome editing of human and mouse genomes

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
scripts		scripts
DataONE_BP_Primer_020212.pdf		DataONE_BP_Primer_020212.pdf
GENIEDataGuide.pdf		GENIEDataGuide.pdf
README.md		README.md
R_inferno.pdf		R_inferno.pdf
R_tricks.md		R_tricks.md
Rcode_style.pdf		Rcode_style.pdf
bash_associate_array.md		bash_associate_array.md
bring_R_to_command_line.md		bring_R_to_command_line.md
idioms_of_R_programming.pdf		idioms_of_R_programming.pdf
tmux_scroll_mode.md		tmux_scroll_mode.md
wget_specific_files.md		wget_specific_files.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

General

coursese

Some biology

Some statistics

Linux commands

Do not give me excel files!

How to name files

parallelization

Statistics

packages for data wrangling, tidying and visualizing.

Handling big data in R

Write your own R package

handling arguments at the command line

visualization in general

python tips and tools

machine learning

Amazon cloud computing

Genomics-visualization-tools

Databases

Large data consortium data mining

Integrative analysis

Tutorials

MOOC(Massive Open Online Courses)

git and version control

Automate your workflow, open science and reproducible research

Survival curve

Organize research for a group

Clustering

CRISPR related

About

Releases

Packages

Languages

glianeuronc123/getting-started-with-genomics-tools-and-resources

Folders and files

Latest commit

History

Repository files navigation

General

coursese

Some biology

Some statistics

Linux commands

Do not give me excel files!

How to name files

parallelization

Statistics

packages for data wrangling, tidying and visualizing.

Handling big data in R

Write your own R package

handling arguments at the command line

visualization in general

python tips and tools

machine learning

Amazon cloud computing

Genomics-visualization-tools

Databases

Large data consortium data mining

Integrative analysis

Tutorials

MOOC(Massive Open Online Courses)

git and version control

Automate your workflow, open science and reproducible research

Survival curve

Organize research for a group

Clustering

CRISPR related

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages