edits

DominiqueLoyer · Sep 20, 2015 · 330ab1c · 330ab1c
1 parent 042f3c9
commit 330ab1c
Show file tree

Hide file tree

Showing 3 changed files with 95 additions and 100 deletions.
diff --git a/advinference/inference_for_highthroughput.Rmd b/advinference/inference_for_highthroughput.Rmd
@@ -8,18 +8,13 @@ library(knitr)
 opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))
 ```
 
-```{r,message=FALSE}
-library(rafalib)
-```
-
-
 ## Inference in Practice
 
 Suppose we were given high-throughput gene expression data that was measured for several individuals in two populations. We are asked to report which genes have different average expression levels in the two populations. If instead thousands of genes we were handed data from just one gene, we could simply apply the inference techniques that we have learned before. We could, for example, use a t-test or some other test. Here we review what changes when we consider high-throughput data.
 
 #### p-values are random variables
 
-An important concept to remember in order to understand the concepts presented in this chapter is that p-values are random variables. To see this  consider the example in which we define a p-value from a t-test with a large enough sample size to use the CLT approximation. Then our p-value is defined as the probability that a normally distributed random variable than the observed t-test, call it $Z$. So for a two sided test: 
+An important concept to remember in order to understand the concepts presented in this chapter is that p-values are random variables. To see this  consider the example in which we define a p-value from a t-test with a large enough sample size to use the CLT approximation. Then our p-value is defined as the probability that a normally distributed random variable is larger, in absolute value, than the observed t-test, call it $Z$. So for a two sided test the p-value is: 
 
 $$
 p = 2 \{ 1 - \Phi(Z)\}
@@ -30,20 +25,18 @@ In R we write:
 2*(1-pnorm(Z))
 ```
 
-Now because $Z$ is a random variable ($\Phi$ is simply a deterministic
-function), $p$ is also a random variable. We will create a Monte Carlo
-simulation showing how the values of $p$ change.
+Now because $Z$ is a random variable and $\Phi$ is a deterministic
+function, $p$ is also a random variable. We will create a Monte Carlo
+simulation showing how the values of $p$ change. We ust`femaleControlsPopulation.csv` from earlier chapters.
 
-First we download the `femaleControlsPopulation.csv` file:
-
-```{r}
+```{r,echo=FALSE}
 library(downloader)
 url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
 filename <- "femaleControlsPopulation.csv"
 if (!file.exists(filename)) download(url,destfile=filename)
 ```
 
-Now, we read in the data, and use `replicate` to repeatedly create p-values.
+We read in the data, and use `replicate` to repeatedly create p-values.
 
 ```{r pvalue_hist, fig.cap="P-value histogram for 10,000 tests in which null hypothesis is true."}
 set.seed(1)
@@ -85,10 +78,12 @@ g <- sampleInfo$group
 g
 ```
 
-If we were interested in a particular gene, let's arbitrarily pick the one on the 25th row, we would simply compute a t-test. Assuming the data is well approximated by normal:
+If we were interested in a particular gene, let's arbitrarily pick the one on the 25th row, we would simply compute a t-test. To compute a p-value we will use the t-distribution approximation thuse we need the popoulation data to be approximately normal. We check this assumption with a qq-plot:
 
 ```{r qqplots_for_one_gene, fig.cap="Normal qq-plots for one gene. Left plot shows first group and right plot shows second group.",fig.width=10.5,fig.height=5.25}
 e <- geneExpression[25,]
+
+library(rafalib)
 mypar(1,2)
 
 qqnorm(e[g==1])
@@ -98,26 +93,26 @@ qqnorm(e[g==0])
 qqline(e[g==0])
 ```
 
-The qq-plots show that the data is well approximated by the normal approximation so apply a t-test. The t-test does not find this gene to be statistically significant:
+The qq-plots show that the data is well approximated by the normal approximation. The t-test does not find this gene to be statistically significant:
 
 ```{r}
-t.test(e[g==1],e[g==0])
+t.test(e[g==1],e[g==0])$p.value
 ```
 
-To answer the question for each gene, we simply do this for every gene. Here we will define our own function and use `apply`:
+To answer the question for each gene, we simply do repeat the above for each gene. Here we will define our own function and use `apply`:
 
 ```{r}
 myttest <- function(x) t.test(x[g==1],x[g==0],var.equal=TRUE)$p.value
 pvals <- apply(geneExpression,1,myttest)
 ```
 
-We can now see which genes have p-values less than, say, 0.05. For example, right away we see that:
+We can now see which genes have p-values less than, say, 0.05. For example, right away we see that...
 
 ```{r}
 sum(pvals<0.05)
 ```
 
-genes had p-values less than 0.05.
+... genes had p-values less than 0.05.
 
 However, as we will describe in more detail below, we have to be careful in interpreting this result because we have performed over 8,000 tests. If we performed the same procedure on random data, for which the null hypothesis is true for all features, we obtain the following results:
 
@@ -130,25 +125,25 @@ nullpvals <- apply(randomData,1,myttest)
 sum(nullpvals<0.05)
 ```
 
-As we will explain later in the chapter, this is to be expected. 419 is roughly 0.05*8192 and we will describe the theory that tells us why this prediction works.
+As we will explain later in the chapter, this is to be expected: 419 is roughly 0.05*8192 and we will describe the theory that tells us why this prediction works.
 
 #### Faster t-test implementation
 
-Before we continue, we should point out that the above implementation is very inefficient. There are several faster implementations that perform t-test for high-throughput data. For example:
+Before we continue, we should point out that the above implementation is very inefficient. There are several faster implementations that perform t-test for high-throughput data. We make use of a function that is not available from CRAN but rather from the Bioconductor project. 
+
+To download and install packages from Bioconductor. We can use the `install_bioc` function in `rafalib` to install the package:
+
+
+```{r,eval=FALSE}
+install_bioc("genefilter")
+```
+
+Now we can show that this function is much faster than our code above and produce practically the same answer:
 
 ```{r,message=FALSE}
 library(genefilter)
 results <- rowttests(geneExpression,factor(g))
 max(abs(pvals-results$p))
 ```
 
-`genefilter` is available from the Bioconductor projects. In a later section we will say more about this project, but here is how to install it:
-
-```{r,eval=FALSE}
-source("http://www.bioconductor.org/biocLite.R")
-biocLite("genefilter")
-```
-
-Note that we get practically the same answer and much faster performance.
-
 
diff --git a/advinference/intro_to_highthroughput_data.Rmd b/advinference/intro_to_highthroughput_data.Rmd
@@ -15,13 +15,9 @@ opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::
 
 ## Introduction
 
-High-throughput technologies have changed basic biology and the biomedical sciences from data poor disciplines to data intensive ones. A specific example comes from research fields interested in understanding gene expression. Gene expression is the process in which DNA, the blueprint for life, is copied into RNA, the templates for the synthesis of proteins, the building blocks for life. In the 1990s, the analysis of gene expression data amounted to spotting black dots on a piece of paper or extracting a few numbers from standard curves. With high-throughput technologies, such as microarrays, this suddenly changed to sifting through tens of thousands of numbers. More recently, RNA-Sequencing has further increased data complexity. Biologists went from using their eyes or simple summaries to categorize results to having thousands (and now millions) of measurements per sample to analyze. In this chapter we will focus on statistical inference. Specifically, we focus on the problem of detecting differences in groups using statistical tests and quantifying uncertainty in a meaningful way. In later chapters we will study the statistics behind clustering, machine learning, factor analysis and multi-level modeling. 
+High-throughput technologies have changed basic biology and the biomedical sciences from data poor disciplines to data intensive ones. A specific example comes from research fields interested in understanding gene expression. Gene expression is the process in which DNA, the blueprint for life, is copied into RNA, the templates for the synthesis of proteins, the building blocks for life. In the 1990s, the analysis of gene expression data amounted to spotting black dots on a piece of paper or extracting a few numbers from standard curves. With high-throughput technologies, such as microarrays, this suddenly changed to sifting through tens of thousands of numbers. More recently, RNA-Sequencing has further increased data complexity. Biologists went from using their eyes or simple summaries to categorize results to having thousands (and now millions) of measurements per sample to analyze. In this chapter we will focus on statistical inference in the context of high-throughput measurements. Specifically, we focus on the problem of detecting differences in groups using statistical tests and quantifying uncertainty in a meaningful way. We also introduce exploratory data analysis techniques that should be used in conjuction with inference when analyzing high-throughput data. In later chapters we will study the statistics behind clustering, machine learning, factor analysis and multi-level modeling. 
 
-Since there is a vast number of available public datasets, we use several gene expression examples. Nonetheless, the statistical techniques you will learn have also proven useful in other fields that make use of high-throughput technologies. Technologies such as microarrays, next generation sequencing, fRMI, and mass spectrometry all produce data to answer questions for which what we learn here will be indispensable. The specific topics we will focus on are inference in the context of high-throughput data, distance and clustering, dimension reduction, machine learning, modeling including Bayesian/hierarchical models and advanced exploratory data analysis. Because there is an interplay between these topics, we will cover each separately. 
-
-```{r,message=FALSE}
-library(rafalib)
-```
+Since there is a vast number of available public datasets, we use several gene expression examples. Nonetheless, the statistical techniques you will learn have also proven useful in other fields that make use of high-throughput technologies. Technologies such as microarrays, next generation sequencing, fRMI, and mass spectrometry all produce data to answer questions for which what we learn here will be indispensable. 
 
 <a name="threetables"></a>
 
@@ -35,11 +31,12 @@ library(devtools)
 install_github("genomicsclass/GSE5859Subset")
 ```
 
+We will aslo be installing Bioconductorpac
 #### The three tables
 
 Most of the data we use as examples in this class are created with high-throughput technologies. These technologies measure thousands of _features_. Examples of feature are genes, single base locations of the genome, genomic regions, or image pixel intensities. Each specific measurement product is defined by a specific set of features. For example, a specific gene expression microarray product is defined by the set of genes that it measures. 
 
-A specific study will typically use one product to make measurements on several experimental units, such as individuals. The most common experimental unit will be the individual, but they can also be defined by other entities, for example different parts of a tumor. We often call the experimental units _samples_ following because experimental jargon. It is important that these are not confused with samples as referred to in previous chapters, for example "random sample". 
+A specific study will typically use one product to make measurements on several experimental units, such as individuals. The most common experimental unit will be the individual, but they can also be defined by other entities, for example different parts of a tumor. We often call the experimental units _samples_ following experimental jargon. It is important that these are not confused with samples as referred to in previous chapters, for example "random sample". 
 
 So a high-throughput experiment is usually defined by three tables: one with the high-throughput measurements and two tables with information about the columns and rows of this first table respectively.
 
@@ -48,7 +45,6 @@ Because a dataset is typically defined by a set of experimental units and a prod
 Here is an example from a gene expression dataset:
 
 ```{r}
-##can be installed with:
 library(GSE5859Subset)
 data(GSE5859Subset) ##this loads the three tables
 dim(geneExpression)
@@ -82,18 +78,3 @@ The table includes an ID that permits us to connect the rows of this table with
 head(match(geneAnnotation$PROBEID,rownames(geneExpression)))
 ```
 The table also includes biological information about the features; namely chromosome location and the gene "name" used by biologists.
-
-#### Examples
-
-Here we list some of the examples of data analysis questions we might be asked to answer with the dataset shown here:
-
-* Inference: For which genes are the population averages different across ethnic groups? 
-
-* Machine learning: Build an algorithm that, given gene expression patterns, predicts ethnic group.
-
-* Clustering: Can we discover subpopulations of individuals from the gene expression patterns? Or can we discover genes pathways based on which cluster together across individuals?
-
-* Exploratory data analysis: Did some experiments failed experiments? Are the assumptions needed to use standard statistical techniques met? 
-
-We will cover all these topics and more in the following sections. 
-