diff --git a/biocintro_5x/bioc1_annoOverview.Rmd b/biocintro_5x/bioc1_annoOverview.Rmd index 66392b2e..93fd25bf 100644 --- a/biocintro_5x/bioc1_annoOverview.Rmd +++ b/biocintro_5x/bioc1_annoOverview.Rmd @@ -15,7 +15,10 @@ opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr::: ``` ```{r setup,echo=FALSE,results="hide"} +suppressWarnings({ +suppressMessages({ suppressPackageStartupMessages({ +library(BiocStyle) library(AnnotationDbi) library(BSgenome.Hsapiens.NCBI.GRCh38) library(BSgenome.Hsapiens.UCSC.hg19) @@ -29,10 +32,17 @@ library(KEGGREST) library(GO.db) library(org.Hs.eg.db) library(DBI) +library(ensembldb) +library(rols) +library(EnsDb.Hsapiens.v75) +library(GSEABase) +library(ph525x) +}) +}) }) ``` -# Basic annotation resources and their discovery +## Basic annotation resources and their discovery In this document we will review Bioconductor's facilities for handling and annotating genomic sequence. We'll look at @@ -44,29 +54,34 @@ Bioconductor is to make it easy to incorporate information on genome structure and function into statistical analysis procedures. -## A simple hierarchy of annotation concepts + + +### A hierarchy of annotation concepts Bioconductor includes many different types of genomic annotation. We can think of these annotation resources in a hierarchical structure. -- At the base is the reference genomic sequence for an organism. +- At the base is the _reference genomic sequence_ for an organism. This is always arranged into chromosomes, specified by linear sequences of nucleotides. - Above this is the organization of chromosomal sequence into -regions of interest. The most prominent regions of interest are +_regions of interest_. The most prominent regions of interest are genes, but other structures like SNPs or CpG sites are annotated as well. Genes have internal structure, with parts that are transcribed and parts that are not, and "gene models" define the ways in which these structures are labeled and laid out in genomic coordinates. -- Above this is the organization of genes or gene products into -groups with shared structural or functional properties. Examples +- Above this is the organization of regions (most often +genes or gene products) into +_groups with shared structural or functional properties_. Examples include pathways, groups of genes found together in cells, or identified as cooperating in biological processes. -## Discovering available reference genomes + + +### Discovering available reference genomes Bioconductor's collection of annotation packages brings all elements of this hierarchy into a programmable environment. @@ -82,7 +97,7 @@ length(ag) head(ag) ``` -## Reference build versions are important +### Reference build versions are important The reference build for an organism is created de novo and then refined as algorithms and sequenced data improve. @@ -108,7 +123,9 @@ this shortly. Software for sequence comparison can check for compatible tags on the sequences being compared, and thereby help to ensure meaningful results. -# A reference genomic sequence for H. sapiens + + +## A reference genomic sequence for H. sapiens The reference sequence for *Homo sapiens* is acquired by installing and attaching @@ -130,10 +147,16 @@ We acquire a chromosome's sequence using the `$` operator. Hsapiens$chr17 ``` -# The transcripts and genes for a reference sequence + + +## The transcripts and genes for a reference sequence + +### UCSC annotation The `TxDb` family of packages and data objects manages -information on transcripts and gene models. +information on transcripts and gene models. We consider +those derived from annotation tables prepared for the +UCSC genome browser. ```{r gettx} library(TxDb.Hsapiens.UCSC.hg19.knownGene) @@ -149,7 +172,62 @@ ghs = genes(txdb) ghs ``` -# Your data will be someone else's annotation: import/export +Filtering is permitted, with suitable identifiers. +Here we select all exons identified for two +different genes, identified by their Entrez Gene ids: + +```{r doexo} +exons(txdb, columns=c("EXONID", "TXNAME", "GENEID"), + filter=list(gene_id=c(100, 101))) +``` + +### ENSEMBL annotation + +From the [Ensembl home page](http://www.ensembl.org/index.html): +"Ensembl creates, integrates and distributes reference datasets and +analysis tools that enable genomics". This project is lodged +at the [European Molecular Biology Lab](https://www.ebi.ac.uk/), +which has been supportive of general interoperation of +annotation resources with +Bioconductor. + +The `r Biocpkg("ensembldb")` package includes a vignette +with the following commentary: + +The ensembldb package provides functions to create and use +transcript centric annotation databases/packages. The annotation for the +databases are +directly fetched from Ensembl 1 using their Perl +API. The functionality and data is similar to +that of the TxDb packages from the GenomicFeatures +package, but, in addition to retrieve all gene/transcript models +and annotations from the database, the +ensembldb package provides also a filter framework allowing +to retrieve annotations for specific entries like +genes encoded on a chromosome region or transcript +models of lincRNA genes. From version 1.7 on, +EnsDb databases created by the ensembldb package contain +also protein annotation data +(see [Section 11](http://bioconductor.org/packages/release/bioc/vignettes/ensembldb/inst/doc/ensembldb.html#org35014ed) for +the database layout and an +overview of available attributes/columns). For more information +on the use of the protein annotations refer to the proteins vignette. + +```{r lkensss} +library(ensembldb) +library(EnsDb.Hsapiens.v75) +names(listTables(EnsDb.Hsapiens.v75)) +``` + +As an illustration: +```{r lktxed} +edb = EnsDb.Hsapiens.v75 # abbreviate +txs <- transcripts(edb, filter = GenenameFilter("ZBTB16"), + columns = c("protein_id", "uniprot_id", "tx_biotype")) +txs +``` + +## Your data will be someone else's annotation: import/export The ENCODE project is a good example of the idea that today's experiment is tomorrow's annotation. You should think of your @@ -160,25 +238,8 @@ answer it with an appropriate, properly executed protocol. ENCODE is noteworthy for linking the protocols to the data very explicitly.) -What we have to watch out for is the idea that annotation is somehow -permanently correct, isolated from the cacophony of research progress -at the boundaries of knowledge. We have seen that even the -reference sequences of human chromosomes are subject to revision. -We have treated, in our use of the ERBS package, results of experiments -that we don't know too much about, as defining ER binding sites for -potential biological interpretation. The uncertainty, variable -quality of peak identification, has not -been explicitly reckoned but it should be. - -Bioconductor has taken pains to acknowledge many facets of this situation. -We maintain archives of prior versions of software -and annotation so that past work can be checked -or revised. We update central annotation resources twice a year so that -there is stability for ongoing work as well as access to new knowledge. -And we have made it simple to import and to create representations -of experimental and annotation data. -As an example, return to the ER binding data. These were published +As an example, we consider estrogen receptor (ER) binding data, published by ENCODE as narrowPeak files. This is ascii text at its base, so can be imported as a set of textual lines with no difficulty. If there is sufficient regularity to the record fields, @@ -207,8 +268,8 @@ of adding the genome metadata to protect against illegitimate combination with data recorded in an incompatible coordinate system. For communicating with other scientists or systems we have two -main options. First, we can save the GRanges as an "RData" object, -easily transmitted to another R user for immediate use. Second, +main options. We can save the GRanges as an "RData" object, +easily transmitted to another R user for immediate use. Or we can export in another standard format. For example, if we are interested only in interval addresses and the binding scores, it is sufficient to save in "bed" format. @@ -222,8 +283,28 @@ We have carried out a "round trip" of importing, modeling, and exporting experimental data that can be integrated with other data to advance biological understanding. -To conclude this group, I mention a newly developed package -called AnnotationHub that can be used to obtain GRanges or other +What we have to watch out for is the idea that annotation is somehow +permanently correct, isolated from the cacophony of research progress +at the boundaries of knowledge. We have seen that even the +reference sequences of human chromosomes are subject to revision. +We have treated, in our use of the ERBS package, results of experiments +that we don't know too much about, as defining ER binding sites for +potential biological interpretation. The uncertainty, variable +quality of peak identification, has not +been explicitly reckoned but it should be. + +Bioconductor has taken pains to acknowledge many facets of this situation. +We maintain archives of prior versions of software +and annotation so that past work can be checked +or revised. We update central annotation resources twice a year so that +there is stability for ongoing work as well as access to new knowledge. +And we have made it simple to import and to create representations +of experimental and annotation data. + +## AnnotationHub + + +The `r Biocpkg("AnnotationHub")` package can be used to obtain GRanges or other suitably designed containers for institutionally curated annotation. Here we will show that there are a number of experimental data objects related to the HepG2 cell line available through AnnotationHub. @@ -235,9 +316,16 @@ ah query(ah, "HepG2") ``` -Advanced users will profit from getting acquainted with this package. +The `query` method can take a vector of filtering strings. +To limit response to annotation resources +addressing the histone H4K5, simply add that tag: + +```{r doonek} +query(ah, c("HepG2", "H4K5")) +``` + -# The NCBI Entrez Gene annotation maps +## The NCBI Entrez Gene annotation maps Packages named org.*.eg.db collect information at the gene level with links to location, protein product identifiers, KEGG pathway and @@ -253,9 +341,9 @@ head(select(org.Hs.eg.db, keys="ORMDL3", keytype="SYMBOL", columns="PMID")) ``` -# Resources for gene sets and pathways +## Resources for gene sets and pathways -## Gene Ontology +### Gene Ontology [Gene Ontology](http://www.geneontology.org) (GO) is a widely used structured vocabulary that organizes terms relevant to @@ -317,7 +405,7 @@ the two terms that are regarded as parents in this database scheme. The entire database schema can be viewed with `GO_dbschema()`. -## KEGG: Kyoto Encyclopedia of Genes and Genomes +### KEGG: Kyoto Encyclopedia of Genes and Genomes The KEGG annotation system has been available in Bioconductor since the latter's inception, but licensing of the database @@ -361,9 +449,52 @@ brpng = keggGet("hsa05212", "image") grid.raster(brpng) ``` +### Additional ontologies + +The +`r Biocpkg("rols")` package interfaces to the EMBL-EBI +[Ontology Lookup Service](https://www.ebi.ac.uk/ols/index). +```{r dools, cache=TRUE} +library(rols) +oo = Ontologies() +oo +oo[[1]] +``` + +To control the amount of network traffic involved in +query retrieval, there are stages of search. +```{r doglioon,cache=TRUE} +glis = OlsSearch("glioblastoma") +glis +res = olsSearch(glis) +dim(res) +resdf = as(res, "data.frame") # get content +resdf[1:4,1:4] +resdf[1,5] # full description for one instance +``` + + +The `r CRANpkg("ontologyIndex")` supports import of ontologies +in the Open Biological Ontologies (OBO) format, and includes +very efficient facilities for querying and visualizing +ontological systems. + +### General gene set management + +The `r Biocpkg("GSEABase")` package has excellent +infrastructure for managing gene sets and collections +thereof. We illustrate by importing a glioblastoma-related +gene set from [MSigDb](http://software.broadinstitute.org/gsea/msigdb/search.jsp). + +```{r dogsea} +library(GSEABase) +glioG = getGmt(system.file("gmt/glioSets.gmt", package="ph525x")) +glioG +head(geneIds(glioG[[1]])) +``` -# A unified, self-describing approach +## A unified, self-describing approach for model organisms The OrganismDb packages simplify access to annotation. Queries that succeed against TxDb, and org.[Nn].eg.db @@ -378,7 +509,7 @@ keytypes(Homo.sapiens) columns(Homo.sapiens) ``` -# Summary +## Summary We have covered a lot of material, from the nucleotide to the pathway level. The Annotation "view" at bioconductor.org