-
Notifications
You must be signed in to change notification settings - Fork 19
/
Copy pathsaver-tutorial.Rmd
165 lines (114 loc) · 8.15 KB
/
saver-tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
title: "SAVER Tutorial"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: TRUE
vignette: >
%\VignetteIndexEntry{SAVER Tutorial}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r eval_saver, include = FALSE}
# Whether or not to evaluate saver. The generated vignette was run setting it
# to be TRUE but since running requires multiple cores, this was set to be
# FALSE for purposes of submission to CRAN.
eval.saver <- FALSE
```
```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = eval.saver)
```
SAVER (Single-cell Analyses Via Expression Recovery) is a method for denoising single-cell RNA sequencing data by borrowing information across genes and cells. Here, we demonstrate how to use the method. For more information, take a look at the [Github page](https://github.com/mohuangx/SAVER) and the [paper](https://doi.org/10.1038/s41592-018-0033-z).
## Installation
You can install the most recent updates of SAVER from github with:
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("mohuangx/SAVER")
```
To install the stable release of SAVER:
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("mohuangx/SAVER@*release")
```
## Getting Started
Once we have the package installed, we can load the package.
```{r, eval = TRUE}
library(SAVER)
packageVersion("SAVER")
```
In this tutorial, we will run SAVER on the [mouse cortex data](https://storage.googleapis.com/linnarsson-lab-www-blobs/blobs/cortex/expression_mRNA_17-Aug-2014.txt) from [Zeisel (2015)](https://www.science.org/doi/abs/10.1126/science.aaa1934).
```{r}
data.path <- "../data/expression_mRNA_17-Aug-2014.txt"
# Need to remove first 10 rows of metadata
raw.data <- read.table(data.path, header = FALSE, skip = 11, row.names = 1,
check.names = FALSE)
cortex <- as.matrix(raw.data[, -1])
cellnames <- read.table(data.path, skip = 7, nrows = 1, row.names = 1,
stringsAsFactors = FALSE)
colnames(cortex) <- cellnames[-1]
dim(cortex)
```
### Preprocessing
The input to SAVER is a matrix of gene expression counts with genes along the rows and cells along the columns. SAVER was developed for use on UMI count data but it seems to work well with non-UMI TPM data. Standard quality control such as removing low quality cells and filtering out genes with low expression should be performed prior to running SAVER.
Here, the Zeisel dataset has already been preprocessed.
### Running SAVER
The main function to run SAVER is the `saver` function. Since SAVER is computationally intensive for large datasets, we recommend running it in parallel on a cluster either by creating your own parallel environment or by specifying `ncores`. We will run SAVER on the Zeisel dataset across 12 cores.
```{r}
cortex.saver <- saver(cortex, ncores = 12)
str(cortex.saver)
```
`cortex.saver` is a saver object with the following components:
* `estimate` gives the library size normalized SAVER estimates.
* `se` gives the standard error of the estimates
* `info` gives the more information about the run.
If you are only interested in the estimate, you can run `saver` setting `estimates.only = TRUE`. For example,
```{r, eval=FALSE}
cortex.saver <- saver(cortex, ncores = 12, estimates.only = TRUE)
```
### Using SAVER estimates
The SAVER estimates represent the library size normalized posterior means of the recovered gene expression. They can be used as input to many of the common downstream analysis methods such as [Seurat](https://satijalab.org/seurat/) and [Monocle](http://cole-trapnell-lab.github.io/monocle-release/).
## Additional Settings
### Normalization
By default, SAVER takes in an unnormalized count matrix and performs library size normalization during the denoising step. However, if your data is already normalized or normalization is not desired, you can set `size.factor = 1`. In addition, if you want to use custom cell size factors, you can set `size.factor` to be a vector of size factors.
### Predicting certain genes
SAVER has two main steps: the first is the prediction step and the second is a shrinkage step. The prediction step is the time-consuming part of SAVER. If you are only interested in a handful of genes, you can specify those genes and generate predictions for those genes only and choose to return the entire data matrix or for those genes only. For example,
```{r, eval=FALSE}
# Identify the indices of the genes of interest
genes <- c("Thy1", "Mbp", "Stim2", "Psmc6", "Rps19")
genes.ind <- which(rownames(cortex) %in% genes)
# Generate predictions for those genes and return entire dataset
cortex.saver.genes <- saver(cortex, pred.genes = genes.ind,
estimates.only = TRUE)
# Generate predictions for those genes and return only those genes
cortex.saver.genes.only <- saver(cortex, pred.genes = genes.ind,
pred.genes.only = TRUE, estimates.only = TRUE)
```
### Using other predictions
The main contribution of SAVER is the shrinkage step after generating the predictions. The shrinkage step allows the algorithm to adaptively evaluate the quality of the predictions and weigh the estimates either towards the predictions or the observed value. SAVER performs a Lasso Poisson regression for each gene using other genes as covariates to generate the predictions. However, if you want to use other predictions such as [MAGIC](https://github.com/KrishnaswamyLab/MAGIC) or [DCA](https://github.com/theislab/dca), you can input the results in matrix format into the `mu` parameter of the `saver` function.
### Creating your own parallel backend
By default, if you specify `ncores` without already having a parallel backend set up, `saver` will make a default cluster using the `makeCluster` function in the parallel package and register a parallel backend via the `registerDoParallel` function in the doParallel package. However, if you want to set up a more complicated backend, such as using MPI across nodes, you can do so and run `saver` without specifying the `ncores` option.
### Combining SAVER objects
Running SAVER on large datasets may run into memory issues, especially when parallelized across cores where each core has limited memory. One solution would be to register a parallel backend using SOCK or MPI to parallelize across nodes. Another solution would be to split the genes into multiple runs using `pred.genes` and `pred.genes.only`. For example, say we have a dataset `x` with 10,000 genes. We can split this up into 4 runs on different machines and combine using the `combine.saver` function. We recommend setting `do.fast = FALSE` when splitting the dataset.
```{r, eval=FALSE}
saver1 <- saver(x, pred.genes = 1:2500, pred.genes.only = TRUE,
do.fast = FALSE)
saver2 <- saver(x, pred.genes = 2501:5000, pred.genes.only = TRUE,
do.fast = FALSE)
saver3 <- saver(x, pred.genes = 5001:7500, pred.genes.only = TRUE,
do.fast = FALSE)
saver4 <- saver(x, pred.genes = 7501:10000, pred.genes.only = TRUE,
do.fast = FALSE)
saver.all <- combine.saver(list(saver1, saver2, saver3, saver4))
```
### Sampling example
Oftentimes for downstream analysis, you may want to sample from the posterior distributions to account for estimation uncertainty. This is similar to performing multiple imputation to obtain multiple imputed datasets. To sample from the posterior distributions, we use the function `sample.saver`, which takes in the `saver` result and outputs sampled datasets. For example,
```{r, eval=FALSE}
samp1 <- sample.saver(saver1, rep = 1, seed = 50)
samp5 <- sample.saver(saver1, rep = 5, seed = 50)
```
### Correlation example
Because the SAVER estimates contain uncertainty, correlations between genes and cells cannot be directly calculated using the SAVER estimates. To adjust for the uncertainty, we provide the functions `cor.genes` and `cor.cells` to construct gene-to-gene and cell-to-cell correlation matrices respectively for the SAVER output. These functions take in the `saver` result and outputs the gene-to-gene or cell-to-cell correlation matrix. For example,
```{r, eval=FALSE}
saver1.cor.gene <- cor.genes(saver1)
saver1.cor.cell <- cor.cells(saver1)
```