Skip to content

Commit

Permalink
updated annoOv
Browse files Browse the repository at this point in the history
  • Loading branch information
vjcitn committed Oct 21, 2017
1 parent 1ef7401 commit 41af849
Showing 1 changed file with 172 additions and 41 deletions.
213 changes: 172 additions & 41 deletions biocintro_5x/bioc1_annoOverview.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,10 @@ opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::
```

```{r setup,echo=FALSE,results="hide"}
suppressWarnings({
suppressMessages({
suppressPackageStartupMessages({
library(BiocStyle)
library(AnnotationDbi)
library(BSgenome.Hsapiens.NCBI.GRCh38)
library(BSgenome.Hsapiens.UCSC.hg19)
Expand All @@ -29,10 +32,17 @@ library(KEGGREST)
library(GO.db)
library(org.Hs.eg.db)
library(DBI)
library(ensembldb)
library(rols)
library(EnsDb.Hsapiens.v75)
library(GSEABase)
library(ph525x)
})
})
})
```

# Basic annotation resources and their discovery
## Basic annotation resources and their discovery

In this document we will review Bioconductor's facilities for
handling and annotating genomic sequence. We'll look at
Expand All @@ -44,29 +54,34 @@ Bioconductor is to make it easy to incorporate
information on genome structure and function
into statistical analysis procedures.

## A simple hierarchy of annotation concepts
<a name="threelev"></a>

### A hierarchy of annotation concepts

Bioconductor includes many different types of genomic annotation.
We can think of these annotation resources in a hierarchical structure.

- At the base is the reference genomic sequence for an organism.
- At the base is the _reference genomic sequence_ for an organism.
This is always arranged into chromosomes, specified by linear
sequences of nucleotides.

- Above this is the organization of chromosomal sequence into
regions of interest. The most prominent regions of interest are
_regions of interest_. The most prominent regions of interest are
genes, but other structures like SNPs or CpG sites are
annotated as well. Genes have internal structure,
with parts that are transcribed and parts that are not,
and "gene models" define the ways in which
these structures are labeled and laid out in genomic coordinates.

- Above this is the organization of genes or gene products into
groups with shared structural or functional properties. Examples
- Above this is the organization of regions (most often
genes or gene products) into
_groups with shared structural or functional properties_. Examples
include pathways, groups of genes found together in cells, or
identified as cooperating in biological processes.

## Discovering available reference genomes
<a names="findingref"></a>

### Discovering available reference genomes

Bioconductor's collection of annotation packages brings
all elements of this hierarchy into a programmable environment.
Expand All @@ -82,7 +97,7 @@ length(ag)
head(ag)
```

## Reference build versions are important
### Reference build versions are important

The reference build for an organism is created de novo
and then refined as algorithms and sequenced data improve.
Expand All @@ -108,7 +123,9 @@ this shortly. Software for sequence comparison can check
for compatible tags on the sequences
being compared, and thereby help to ensure meaningful results.

# A reference genomic sequence for H. sapiens
<a name="hsap"></a>

## A reference genomic sequence for H. sapiens

The reference sequence for *Homo sapiens* is acquired by installing
and attaching
Expand All @@ -130,10 +147,16 @@ We acquire a chromosome's sequence using the `$` operator.
Hsapiens$chr17
```

# The transcripts and genes for a reference sequence
<a name="txUCSCnENSEMBLE"></a>

## The transcripts and genes for a reference sequence

### UCSC annotation

The `TxDb` family of packages and data objects manages
information on transcripts and gene models.
information on transcripts and gene models. We consider
those derived from annotation tables prepared for the
UCSC genome browser.

```{r gettx}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
Expand All @@ -149,7 +172,62 @@ ghs = genes(txdb)
ghs
```

# Your data will be someone else's annotation: import/export
Filtering is permitted, with suitable identifiers.
Here we select all exons identified for two
different genes, identified by their Entrez Gene ids:

```{r doexo}
exons(txdb, columns=c("EXONID", "TXNAME", "GENEID"),
filter=list(gene_id=c(100, 101)))
```

### ENSEMBL annotation

From the [Ensembl home page](http://www.ensembl.org/index.html):
"Ensembl creates, integrates and distributes reference datasets and
analysis tools that enable genomics". This project is lodged
at the [European Molecular Biology Lab](https://www.ebi.ac.uk/),
which has been supportive of general interoperation of
annotation resources with
Bioconductor.

The `r Biocpkg("ensembldb")` package includes a vignette
with the following commentary:

The ensembldb package provides functions to create and use
transcript centric annotation databases/packages. The annotation for the
databases are
directly fetched from Ensembl 1 using their Perl
API. The functionality and data is similar to
that of the TxDb packages from the GenomicFeatures
package, but, in addition to retrieve all gene/transcript models
and annotations from the database, the
ensembldb package provides also a filter framework allowing
to retrieve annotations for specific entries like
genes encoded on a chromosome region or transcript
models of lincRNA genes. From version 1.7 on,
EnsDb databases created by the ensembldb package contain
also protein annotation data
(see [Section 11](http://bioconductor.org/packages/release/bioc/vignettes/ensembldb/inst/doc/ensembldb.html#org35014ed) for
the database layout and an
overview of available attributes/columns). For more information
on the use of the protein annotations refer to the proteins vignette.

```{r lkensss}
library(ensembldb)
library(EnsDb.Hsapiens.v75)
names(listTables(EnsDb.Hsapiens.v75))
```

As an illustration:
```{r lktxed}
edb = EnsDb.Hsapiens.v75 # abbreviate
txs <- transcripts(edb, filter = GenenameFilter("ZBTB16"),
columns = c("protein_id", "uniprot_id", "tx_biotype"))
txs
```

## Your data will be someone else's annotation: import/export

The ENCODE project is a good example of the idea that today's
experiment is tomorrow's annotation. You should think of your
Expand All @@ -160,25 +238,8 @@ answer it with an appropriate, properly executed protocol.
ENCODE is noteworthy for linking the protocols to the data
very explicitly.)

What we have to watch out for is the idea that annotation is somehow
permanently correct, isolated from the cacophony of research progress
at the boundaries of knowledge. We have seen that even the
reference sequences of human chromosomes are subject to revision.
We have treated, in our use of the ERBS package, results of experiments
that we don't know too much about, as defining ER binding sites for
potential biological interpretation. The uncertainty, variable
quality of peak identification, has not
been explicitly reckoned but it should be.

Bioconductor has taken pains to acknowledge many facets of this situation.
We maintain archives of prior versions of software
and annotation so that past work can be checked
or revised. We update central annotation resources twice a year so that
there is stability for ongoing work as well as access to new knowledge.
And we have made it simple to import and to create representations
of experimental and annotation data.

As an example, return to the ER binding data. These were published
As an example, we consider estrogen receptor (ER) binding data, published
by ENCODE as narrowPeak files. This is ascii text at its base, so
can be imported as a set of textual lines with no difficulty.
If there is sufficient regularity to the record fields,
Expand Down Expand Up @@ -207,8 +268,8 @@ of adding the genome metadata to protect against illegitimate
combination with data recorded in an incompatible coordinate system.

For communicating with other scientists or systems we have two
main options. First, we can save the GRanges as an "RData" object,
easily transmitted to another R user for immediate use. Second,
main options. We can save the GRanges as an "RData" object,
easily transmitted to another R user for immediate use. Or
we can export in another standard format. For example, if we
are interested only in interval addresses and the binding scores,
it is sufficient to save in "bed" format.
Expand All @@ -222,8 +283,28 @@ We have carried out a "round trip" of importing, modeling, and exporting
experimental data that can be integrated with other data to advance
biological understanding.

To conclude this group, I mention a newly developed package
called AnnotationHub that can be used to obtain GRanges or other
What we have to watch out for is the idea that annotation is somehow
permanently correct, isolated from the cacophony of research progress
at the boundaries of knowledge. We have seen that even the
reference sequences of human chromosomes are subject to revision.
We have treated, in our use of the ERBS package, results of experiments
that we don't know too much about, as defining ER binding sites for
potential biological interpretation. The uncertainty, variable
quality of peak identification, has not
been explicitly reckoned but it should be.

Bioconductor has taken pains to acknowledge many facets of this situation.
We maintain archives of prior versions of software
and annotation so that past work can be checked
or revised. We update central annotation resources twice a year so that
there is stability for ongoing work as well as access to new knowledge.
And we have made it simple to import and to create representations
of experimental and annotation data.

## AnnotationHub


The `r Biocpkg("AnnotationHub")` package can be used to obtain GRanges or other
suitably designed containers for institutionally curated annotation.
Here we will show that there are a number of experimental data
objects related to the HepG2 cell line available through AnnotationHub.
Expand All @@ -235,9 +316,16 @@ ah
query(ah, "HepG2")
```

Advanced users will profit from getting acquainted with this package.
The `query` method can take a vector of filtering strings.
To limit response to annotation resources
addressing the histone H4K5, simply add that tag:

```{r doonek}
query(ah, c("HepG2", "H4K5"))
```


# The NCBI Entrez Gene annotation maps
## The NCBI Entrez Gene annotation maps

Packages named org.*.eg.db collect information at the gene level
with links to location, protein product identifiers, KEGG pathway and
Expand All @@ -253,9 +341,9 @@ head(select(org.Hs.eg.db, keys="ORMDL3", keytype="SYMBOL",
columns="PMID"))
```

# Resources for gene sets and pathways
## Resources for gene sets and pathways

## Gene Ontology
### Gene Ontology

[Gene Ontology](http://www.geneontology.org) (GO) is
a widely used structured vocabulary that organizes terms relevant to
Expand Down Expand Up @@ -317,7 +405,7 @@ the two terms that are regarded as parents in this database scheme.

The entire database schema can be viewed with `GO_dbschema()`.

## KEGG: Kyoto Encyclopedia of Genes and Genomes
### KEGG: Kyoto Encyclopedia of Genes and Genomes

The KEGG annotation system has been available in Bioconductor
since the latter's inception, but licensing of the database
Expand Down Expand Up @@ -361,9 +449,52 @@ brpng = keggGet("hsa05212", "image")
grid.raster(brpng)
```

### Additional ontologies

The
`r Biocpkg("rols")` package interfaces to the EMBL-EBI
[Ontology Lookup Service](https://www.ebi.ac.uk/ols/index).
```{r dools, cache=TRUE}
library(rols)
oo = Ontologies()
oo
oo[[1]]
```

To control the amount of network traffic involved in
query retrieval, there are stages of search.
```{r doglioon,cache=TRUE}
glis = OlsSearch("glioblastoma")
glis
res = olsSearch(glis)
dim(res)
resdf = as(res, "data.frame") # get content
resdf[1:4,1:4]
resdf[1,5] # full description for one instance
```


The `r CRANpkg("ontologyIndex")` supports import of ontologies
in the Open Biological Ontologies (OBO) format, and includes
very efficient facilities for querying and visualizing
ontological systems.

### General gene set management

The `r Biocpkg("GSEABase")` package has excellent
infrastructure for managing gene sets and collections
thereof. We illustrate by importing a glioblastoma-related
gene set from [MSigDb](http://software.broadinstitute.org/gsea/msigdb/search.jsp).

```{r dogsea}
library(GSEABase)
glioG = getGmt(system.file("gmt/glioSets.gmt", package="ph525x"))
glioG
head(geneIds(glioG[[1]]))
```


# A unified, self-describing approach
## A unified, self-describing approach for model organisms

The OrganismDb packages simplify access to annotation.
Queries that succeed against TxDb, and org.[Nn].eg.db
Expand All @@ -378,7 +509,7 @@ keytypes(Homo.sapiens)
columns(Homo.sapiens)
```

# Summary
## Summary

We have covered a lot of material, from the nucleotide to the
pathway level. The Annotation "view" at bioconductor.org
Expand Down

0 comments on commit 41af849

Please sign in to comment.