updated annoOv

DominiqueLoyer · Oct 21, 2017 · 41af849 · 41af849
1 parent 1ef7401
commit 41af849
Showing 1 changed file with 172 additions and 41 deletions.
diff --git a/biocintro_5x/bioc1_annoOverview.Rmd b/biocintro_5x/bioc1_annoOverview.Rmd
@@ -15,7 +15,10 @@ opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::
 ```
 
 ```{r setup,echo=FALSE,results="hide"}
+suppressWarnings({
+suppressMessages({
 suppressPackageStartupMessages({
+library(BiocStyle)
 library(AnnotationDbi)
 library(BSgenome.Hsapiens.NCBI.GRCh38)
 library(BSgenome.Hsapiens.UCSC.hg19)
@@ -29,10 +32,17 @@ library(KEGGREST)
 library(GO.db)
 library(org.Hs.eg.db)
 library(DBI)
+library(ensembldb)
+library(rols)
+library(EnsDb.Hsapiens.v75)
+library(GSEABase)
+library(ph525x)
+})
+})
 })
 ```
 
-# Basic annotation resources and their discovery
+## Basic annotation resources and their discovery
 
 In this document we will review Bioconductor's facilities for
 handling and annotating genomic sequence.  We'll look at
@@ -44,29 +54,34 @@ Bioconductor is to make it easy to incorporate
 information on genome structure and function 
 into statistical analysis procedures.
 
-## A simple hierarchy of annotation concepts
+<a name="threelev"></a>
+
+### A hierarchy of annotation concepts
 
 Bioconductor includes many different types of genomic annotation.
 We can think of these annotation resources in a hierarchical structure.
 
-- At the base is the reference genomic sequence for an organism.
+- At the base is the _reference genomic sequence_ for an organism.
 This is always arranged into chromosomes, specified by linear
 sequences of nucleotides.
 
 - Above this is the organization of chromosomal sequence into
-regions of interest.  The most prominent regions of interest are
+_regions of interest_.  The most prominent regions of interest are
 genes, but other structures like SNPs or CpG sites are
 annotated as well.  Genes have internal structure,
 with parts that are transcribed and parts that are not,
 and "gene models" define the ways in which
 these structures are labeled and laid out in genomic coordinates.
 
-- Above this is the organization of genes or gene products into
-groups with shared structural or functional properties.  Examples
+- Above this is the organization of regions (most often
+genes or gene products) into
+_groups with shared structural or functional properties_.  Examples
 include pathways, groups of genes found together in cells, or
 identified as cooperating in biological processes.
 
-## Discovering available reference genomes
+<a names="findingref"></a>
+
+### Discovering available reference genomes
 
 Bioconductor's collection of annotation packages brings
 all elements of this hierarchy into a programmable environment.
@@ -82,7 +97,7 @@ length(ag)
 head(ag)
 ```
 
-## Reference build versions are important
+### Reference build versions are important
 
 The reference build for an organism is created de novo
 and then refined as algorithms and sequenced data improve.
@@ -108,7 +123,9 @@ this shortly.  Software for sequence comparison can check
 for compatible tags on the sequences
 being compared, and thereby help to ensure meaningful results.
 
-# A reference genomic sequence for H. sapiens
+<a name="hsap"></a>
+
+## A reference genomic sequence for H. sapiens
 
 The reference sequence for *Homo sapiens* is acquired by installing
 and attaching
@@ -130,10 +147,16 @@ We acquire a chromosome's sequence using the `$` operator.
 Hsapiens$chr17
 ```
 
-# The transcripts and genes for a reference sequence
+<a name="txUCSCnENSEMBLE"></a>
+
+## The transcripts and genes for a reference sequence
+
+### UCSC annotation
 
 The `TxDb` family of packages and data objects manages
-information on transcripts and gene models.
+information on transcripts and gene models.  We consider
+those derived from annotation tables prepared for the
+UCSC genome browser.
 
 ```{r gettx}
 library(TxDb.Hsapiens.UCSC.hg19.knownGene)
@@ -149,7 +172,62 @@ ghs = genes(txdb)
 ghs
 ```
 
-# Your data will be someone else's annotation: import/export
+Filtering is permitted, with suitable identifiers.
+Here we select all exons identified for two
+different genes, identified by their Entrez Gene ids:
+
+```{r doexo}
+exons(txdb, columns=c("EXONID", "TXNAME", "GENEID"),
+                  filter=list(gene_id=c(100, 101)))
+```
+
+### ENSEMBL annotation
+
+From the [Ensembl home page](http://www.ensembl.org/index.html):
+"Ensembl creates, integrates and distributes reference datasets and 
+analysis tools that enable genomics".  This project is lodged
+at the [European Molecular Biology Lab](https://www.ebi.ac.uk/),
+which has been supportive of general interoperation of
+annotation resources with 
+Bioconductor.
+
+The `r Biocpkg("ensembldb")` package includes a vignette
+with the following commentary:
+
+The ensembldb package provides functions to create and use 
+transcript centric annotation databases/packages. The annotation for the 
+databases are 
+directly fetched from Ensembl 1 using their Perl 
+API. The functionality and data is similar to 
+that of the TxDb packages from the  GenomicFeatures 
+package, but, in addition to retrieve all gene/transcript models 
+and annotations from the database, the 
+ensembldb package provides also a filter framework allowing 
+to retrieve annotations for specific entries like 
+genes encoded on a chromosome region or transcript 
+models of lincRNA genes. From version 1.7 on, 
+EnsDb databases created by the ensembldb package contain 
+also protein annotation data 
+(see [Section 11](http://bioconductor.org/packages/release/bioc/vignettes/ensembldb/inst/doc/ensembldb.html#org35014ed) for 
+the database layout and an 
+overview of available attributes/columns). For more information 
+on the use of the protein annotations refer to the proteins vignette.
+
+```{r lkensss}
+library(ensembldb)
+library(EnsDb.Hsapiens.v75)
+names(listTables(EnsDb.Hsapiens.v75))
+```
+
+As an illustration:
+```{r lktxed}
+edb = EnsDb.Hsapiens.v75  # abbreviate
+txs <- transcripts(edb, filter = GenenameFilter("ZBTB16"),
+                   columns = c("protein_id", "uniprot_id", "tx_biotype"))
+txs
+```
+
+## Your data will be someone else's annotation: import/export
 
 The ENCODE project is a good example of the idea that today's
 experiment is tomorrow's annotation.  You should think of your
@@ -160,25 +238,8 @@ answer it with an appropriate, properly executed protocol.
 ENCODE is noteworthy for linking the protocols to the data
 very explicitly.)
 
-What we have to watch out for is the idea that annotation is somehow
-permanently correct, isolated from the cacophony of research progress
-at the boundaries of knowledge.  We have seen that even the
-reference sequences of human chromosomes are subject to revision.
-We have treated, in our use of the ERBS package, results of experiments
-that we don't know too much about, as defining ER binding sites for
-potential biological interpretation.  The uncertainty, variable
-quality of peak identification, has not 
-been explicitly reckoned but it should be.
-
-Bioconductor has taken pains to acknowledge many facets of this situation.
-We maintain archives of prior versions of software
-and annotation so that past work can be checked
-or revised.  We update central annotation resources twice a year so that
-there is stability for ongoing work as well as access to new knowledge.
-And we have made it simple to import and to create representations
-of experimental and annotation data.
 
-As an example, return to the ER binding data.  These were published
+As an example, we consider estrogen receptor (ER) binding data, published
 by ENCODE as narrowPeak files.  This is ascii text at its base, so
 can be imported as a set of textual lines with no difficulty.
 If there is sufficient regularity to the record fields, 
@@ -207,8 +268,8 @@ of adding the genome metadata to protect against illegitimate
 combination with data recorded in an incompatible coordinate system.
 
 For communicating with other scientists or systems we have two
-main options.  First, we can save the GRanges as an "RData" object,
-easily transmitted to another R user for immediate use.  Second,
+main options.  We can save the GRanges as an "RData" object,
+easily transmitted to another R user for immediate use.  Or
 we can export in another standard format.  For example, if we
 are interested only in interval addresses and the binding scores,
 it is sufficient to save in "bed" format.
@@ -222,8 +283,28 @@ We have carried out a "round trip" of importing, modeling, and exporting
 experimental data that can be integrated with other data to advance
 biological understanding.
 
-To conclude this group, I mention a newly developed package
-called AnnotationHub that can be used to obtain GRanges or other
+What we have to watch out for is the idea that annotation is somehow
+permanently correct, isolated from the cacophony of research progress
+at the boundaries of knowledge.  We have seen that even the
+reference sequences of human chromosomes are subject to revision.
+We have treated, in our use of the ERBS package, results of experiments
+that we don't know too much about, as defining ER binding sites for
+potential biological interpretation.  The uncertainty, variable
+quality of peak identification, has not 
+been explicitly reckoned but it should be.
+
+Bioconductor has taken pains to acknowledge many facets of this situation.
+We maintain archives of prior versions of software
+and annotation so that past work can be checked
+or revised.  We update central annotation resources twice a year so that
+there is stability for ongoing work as well as access to new knowledge.
+And we have made it simple to import and to create representations
+of experimental and annotation data.
+
+## AnnotationHub
+
+
+The `r Biocpkg("AnnotationHub")` package can be used to obtain GRanges or other
 suitably designed containers for institutionally curated annotation.
 Here we will show that there are a number of experimental data
 objects related to the HepG2 cell line available through AnnotationHub.
@@ -235,9 +316,16 @@ ah
 query(ah, "HepG2")
 ```
 
-Advanced users will profit from getting acquainted with this package.
+The `query` method can take a vector of filtering strings.
+To limit response to annotation resources
+addressing the histone H4K5, simply add that tag:
+
+```{r doonek}
+query(ah, c("HepG2", "H4K5"))
+```
+
 
-# The NCBI Entrez Gene annotation maps
+## The NCBI Entrez Gene annotation maps
 
 Packages named org.*.eg.db collect information at the gene level
 with links to location, protein product identifiers, KEGG pathway and
@@ -253,9 +341,9 @@ head(select(org.Hs.eg.db, keys="ORMDL3", keytype="SYMBOL",
    columns="PMID"))
 ```
 
-# Resources for gene sets and pathways
+## Resources for gene sets and pathways
 
-## Gene Ontology
+### Gene Ontology
 
 [Gene Ontology](http://www.geneontology.org) (GO) is
 a widely used structured vocabulary that organizes terms relevant to
@@ -317,7 +405,7 @@ the two terms that are regarded as parents in this database scheme.
 
 The entire database schema can be viewed with `GO_dbschema()`.
 
-## KEGG: Kyoto Encyclopedia of Genes and Genomes
+### KEGG: Kyoto Encyclopedia of Genes and Genomes
 
 The KEGG annotation system has been available in Bioconductor 
 since the latter's inception, but licensing of the database
@@ -361,9 +449,52 @@ brpng = keggGet("hsa05212", "image")
 grid.raster(brpng)
 ```
 
+### Additional ontologies
+
+The
+`r Biocpkg("rols")` package interfaces to the EMBL-EBI
+[Ontology Lookup Service](https://www.ebi.ac.uk/ols/index).
+```{r dools, cache=TRUE}
+library(rols)
+oo = Ontologies()
+oo
+oo[[1]]
+```
+
+To control the amount of network traffic involved in
+query retrieval, there are stages of search.
+```{r doglioon,cache=TRUE}
+glis = OlsSearch("glioblastoma")
+glis
+res = olsSearch(glis)
+dim(res)
+resdf = as(res, "data.frame") # get content
+resdf[1:4,1:4]
+resdf[1,5]  # full description for one instance
+```
+
+
+The `r CRANpkg("ontologyIndex")` supports import of ontologies
+in the Open Biological Ontologies (OBO) format, and includes
+very efficient facilities for querying and visualizing
+ontological systems.  
+
+### General gene set management
+
+The `r Biocpkg("GSEABase")` package has excellent
+infrastructure for managing gene sets and collections
+thereof.  We illustrate by importing a glioblastoma-related
+gene set from [MSigDb](http://software.broadinstitute.org/gsea/msigdb/search.jsp).
+
+```{r dogsea}
+library(GSEABase)
+glioG = getGmt(system.file("gmt/glioSets.gmt", package="ph525x"))
+glioG
+head(geneIds(glioG[[1]]))
+```
 
 
-# A unified, self-describing approach
+## A unified, self-describing approach for model organisms
 
 The OrganismDb packages simplify access to annotation.
 Queries that succeed against TxDb, and org.[Nn].eg.db
@@ -378,7 +509,7 @@ keytypes(Homo.sapiens)
 columns(Homo.sapiens)
 ```
 
-# Summary
+## Summary
 
 We have covered a lot of material, from the nucleotide to the
 pathway level.  The Annotation "view" at bioconductor.org