forked from satijalab/seurat
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmultimodal_reference_mapping.Rmd
389 lines (302 loc) · 18.2 KB
/
multimodal_reference_mapping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
---
title: "Multimodal reference mapping"
output:
html_document:
theme: united
df_print: kable
date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`'
---
```{r setup, include=FALSE}
all_times <- list() # store the time for each chunk
knitr::knit_hooks$set(time_it = local({
now <- NULL
function(before, options) {
if (before) {
now <<- Sys.time()
} else {
res <- difftime(Sys.time(), now)
all_times[[options$label]] <<- res
}
}
}))
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
time_it = TRUE
)
```
# Intro: Seurat v4 Reference Mapping
This vignette introduces the process of mapping query datasets to annotated references in Seurat. In this example, we map one of the first scRNA-seq datasets released by 10X Genomics of 2,700 PBMC to our [recently described CITE-seq reference of 162,000 PBMC measured with 228 antibodies](https://doi.org/10.1016/j.cell.2021.04.048). We chose this example to demonstrate how supervised analysis guided by a reference dataset can help to enumerate cell states that would be challenging to find with [unsupervised analysis](pbmc3k_tutorial.html). In a second example, we demonstrate how to serially map Human Cell Atlas datasets of human BMNC profiled from different individuals onto a consistent reference.
We have [previously demonstrated](integration_mapping.html) how to use reference-mapping approach to annotate cell labels in a query dataset . In Seurat v4, we have substantially improved the speed and memory requirements for integrative tasks including reference mapping, and also include new functionality to project query cells onto a previously computed UMAP visualization.
In this vignette, we demonstrate how to use a previously established reference to interpret an scRNA-seq query:
* Annotate each query cell based on a set of reference-defined cell states
* Project each query cell onto a previously computed UMAP visualization
* Impute the predicted levels of surface proteins that were measured in the CITE-seq reference
To run this vignette please install Seurat v4, available on CRAN. Additionally, you will need to install the `SeuratDisk` package.
```{r install, eval = FALSE}
install.packages("Seurat")
remotes::install_github("mojaveazure/seurat-disk")
```
```{r packages, cache=FALSE}
library(Seurat)
library(SeuratDisk)
library(ggplot2)
library(patchwork)
```
```{r, include = FALSE, cache=FALSE}
options(SeuratData.repo.use = "http://satijalab04.nygenome.org")
```
# Example 1: Mapping human peripheral blood cells
## A Multimodal PBMC Reference Dataset
We load the reference (download [here](https://atlas.fredhutch.org/data/nygc/multimodal/pbmc_multimodal.h5seurat)) from our recent [paper](https://doi.org/10.1016/j.cell.2021.04.048), and visualize the pre-computed UMAP. This reference is stored as an h5Seurat file, a format that enables on-disk storage of multimodal Seurat objects (more details on h5Seurat and `SeuratDisk` can be found [here](https://mojaveazure.github.io/seurat-disk/index.html)).
```{r pbmc.ref}
reference <- LoadH5Seurat("../data/pbmc_multimodal.h5seurat")
```
```{r ref.dimplot}
DimPlot(object = reference, reduction = "wnn.umap", group.by = "celltype.l2", label = TRUE, label.size = 3, repel = TRUE) + NoLegend()
```
## Mapping
To demonstrate mapping to this multimodal reference, we will use a dataset of 2,700 PBMCs generated by 10x Genomics and available via `SeuratData`.
```{r 3k.load}
library(SeuratData)
InstallData('pbmc3k')
```
The reference was normalized using `SCTransform()`, so we use the same approach to normalize the query here.
```{r 3k.preprocess, results="hide"}
pbmc3k <- SCTransform(pbmc3k, verbose = FALSE)
```
We then find anchors between reference and query. As described in the [manuscript](https://doi.org/10.1016/j.cell.2021.04.048), we used a precomputed supervised PCA (spca) transformation for this example. We recommend the use of supervised PCA for CITE-seq datasets, and demonstrate how to compute this transformation on the next tab of this vignette. However, you can also use a standard PCA transformation.
```{r transfer.anchors}
anchors <- FindTransferAnchors(
reference = reference,
query = pbmc3k,
normalization.method = "SCT",
reference.reduction = "spca",
dims = 1:50
)
```
We then transfer cell type labels and protein data from the reference to the query. Additionally, we project the query data onto the UMAP structure of the reference.
```{r transfer}
pbmc3k <- MapQuery(
anchorset = anchors,
query = pbmc3k,
reference = reference,
refdata = list(
celltype.l1 = "celltype.l1",
celltype.l2 = "celltype.l2",
predicted_ADT = "ADT"
),
reference.reduction = "spca",
reduction.model = "wnn.umap"
)
```
<details>
<summary>**What is `MapQuery` doing?**</summary>
`MapQuery()` is a wrapper around three functions: `TransferData()`, `IntegrateEmbeddings()`, and `ProjectUMAP()`. `TransferData()` is used to transfer cell type labels and impute the ADT values. `IntegrateEmbeddings()` and `ProjectUMAP()` are used to project the query data onto the UMAP structure of the reference. The equivalent code for doing this with the intermediate functions is below:
```{r, eval=FALSE}
pbmc3k <- TransferData(
anchorset = anchors,
reference = reference,
query = pbmc3k,
refdata = list(
celltype.l1 = "celltype.l1",
celltype.l2 = "celltype.l2",
predicted_ADT = "ADT")
)
pbmc3k <- IntegrateEmbeddings(
anchorset = anchors,
reference = reference,
query = pbmc3k,
new.reduction.name = "ref.spca"
)
pbmc3k <- ProjectUMAP(
query = pbmc3k,
query.reduction = "ref.spca",
reference = reference,
reference.reduction = "spca",
reduction.model = "wnn.umap"
)
```
</details>
## Explore the mapping results
We can now visualize the 2,700 query cells. They have been projected into a UMAP visualization defined by the reference, and each has received annotations at two levels of granularity (level 1, and level 2).
```{r 3k.refdimplots, fig.width=10}
p1 = DimPlot(pbmc3k, reduction = "ref.umap", group.by = "predicted.celltype.l1", label = TRUE, label.size = 3, repel = TRUE) + NoLegend()
p2 = DimPlot(pbmc3k, reduction = "ref.umap", group.by = "predicted.celltype.l2", label = TRUE, label.size = 3 ,repel = TRUE) + NoLegend()
p1 + p2
```
The reference-mapped dataset helps us identify cell types that were previously blended in an [unsupervised analysis of the query dataset](pbmc3k_tutorial.html). Just a few examples include plasmacytoid dendritic cells (pDC), hematopoietic stem and progenitor cells (HSPC), regulatory T cells (Treg), CD8 Naive T cells, cells, CD56+ NK cells, memory, and naive B cells, and plasmablasts.
Each prediction is assigned a score between 0 and 1.
```{r 3k.featureplots1, fig.width = 10, fig.height =4}
FeaturePlot(pbmc3k, features = c("pDC", "CD16 Mono", "Treg"), reduction = "ref.umap", cols = c("lightgrey", "darkred"), ncol = 3) & theme(plot.title = element_text(size = 10))
```
```{r save.img, include=FALSE}
library(ggplot2)
plot <- FeaturePlot(pbmc3k, features = "CD16 Mono", reduction = "ref.umap", cols = c("lightgrey", "darkred")) + ggtitle("CD16 Mono") + theme(plot.title = element_text(hjust = 0.5, size = 30)) + labs(color = "Prediction Score") + xlab("UMAP 1") + ylab("UMAP 2") +
theme(axis.title = element_text(size = 18), legend.text = element_text(size = 18), legend.title = element_text(size = 25))
ggsave(filename = "../output/images/multimodal_reference_mapping.jpg", height = 7, width = 12, plot = plot, quality = 50)
```
We can verify our predictions by exploring the expression of canonical marker genes. For example, CLEC4C and LIRA4 have been [reported](https://pubmed.ncbi.nlm.nih.gov/30395816/) as markers of pDC identity, consistent with our predictions. Similarly, if we perform differential expression to identify markers of Tregs, we identify a set of [canonical markers](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4761514/) including RTKN2, CTLA4, FOXP3, and IL2RA.
```{r 3k.VlnPlots, fig.height=6, fig.width = 10, fig.height =5}
Idents(pbmc3k) <- 'predicted.celltype.l2'
VlnPlot(pbmc3k, features = c("CLEC4C", "LILRA4"), sort = TRUE) + NoLegend()
treg_markers <- FindMarkers(pbmc3k, ident.1 = "Treg", only.pos = TRUE, logfc.threshold = 0.1)
print(head(treg_markers))
```
Finally, we can visualize the imputed levels of surface protein, which were inferred based on the CITE-seq reference.
```{r 3k.featureplots2, fig.width=10, fig.height =4}
DefaultAssay(pbmc3k) <- 'predicted_ADT'
# see a list of proteins: rownames(pbmc3k)
FeaturePlot(pbmc3k, features = c("CD3-1", "CD45RA", "IgD"), reduction = "ref.umap", cols = c("lightgrey", "darkgreen"), ncol = 3)
```
## Computing a new UMAP visualiztion
In the previous examples, we visualize the query cells after mapping to the reference-derived UMAP. Keeping a consistent visualization can assist with the interpretation of new datasets. However, if there are cell states that are present in the query dataset that are not represented in the reference, they will project to the most similar cell in the reference. This is the expected behavior and functionality as established by the UMAP package, but can potentially mask the presence of new cell types in the query which may be of interest.
In our [manuscript](https://doi.org/10.1016/j.cell.2021.04.048), we map a query dataset containing developing and differentiated neutrophils, which are not included in our reference. We find that computing a new UMAP ('de novo visualization') after merging the reference and query can help to identify these populations, as demonstrated in Supplementary Figure 8. In the 'de novo' visualization, unique cell states in the query remain separated. In this example, the 2,700 PBMC does not contain unique cell states, but we demonstrate how to compute this visualization below.
We emphasize that if users are attempting to map datasets where the underlying samples are not PBMC, or contain cell types that are not present in the reference, computing a 'de novo' visualization is an important step in interpreting their dataset.
```{r hiddendiet, include=FALSE}
reference <- DietSeurat(reference, counts = FALSE, dimreducs = "spca")
pbmc3k <- DietSeurat(pbmc3k, counts = FALSE, dimreducs = "ref.spca")
```
```{r denovoumap}
#merge reference and query
reference$id <- 'reference'
pbmc3k$id <- 'query'
refquery <- merge(reference, pbmc3k)
refquery[["spca"]] <- merge(reference[["spca"]], pbmc3k[["ref.spca"]])
refquery <- RunUMAP(refquery, reduction = 'spca', dims = 1:50)
DimPlot(refquery, group.by = 'id', shuffle = TRUE)
```
# Example 2: Mapping human bone marrow cells
## A Multimodal BMNC Reference Dataset
As a second example, we map a dataset of human bone marrow mononuclear (BMNC) cells from eight individual donors, produced by the Human Cell Atlas. As a reference, we use the CITE-seq reference of human BMNC that we analyzed using [weighted-nearest neighbor analysis (WNN)](weighted_nearest_neighbor_analysis.html).
This vignette exhibits the same reference-mapping functionality as the PBMC example on the previous tab. In addition, we also demonstrate:
* How to construct a supervised PCA (sPCA) transformation
* How to serially map multiple datasets to the same reference
* Optimization steps to further enhance to speed of mapping
```{r bmref.seuratdata}
# Both datasets are available through SeuratData
library(SeuratData)
#load reference data
InstallData("bmcite")
bm <- LoadData(ds = "bmcite")
#load query data
InstallData('hcabm40k')
hcabm40k <- LoadData(ds = "hcabm40k")
```
The reference dataset contains a [WNN graph](weighted_nearest_neighbor_analysis.html), reflecting a weighted combination of the RNA and protein data in this CITE-seq experiment.
We can compute a UMAP visualization based on this graph. We set `return.model = TRUE`, which will enable us to project query datasets onto this visualization.
```{r bm.refdimplot, fig.width=8}
bm <- RunUMAP(bm, nn.name = "weighted.nn", reduction.name = "wnn.umap",
reduction.key = "wnnUMAP_", return.model = TRUE)
DimPlot(bm, group.by = "celltype.l2", reduction = "wnn.umap")
```
## Computing an sPCA transformation
As described in our [manuscript](https://doi.org/10.1016/j.cell.2021.04.048), we first compute a 'supervised' PCA. This identifies the transformation of the transcriptome data that best encapsulates the structure of the WNN graph. This allows a weighted combination of the protein and RNA measurements to 'supervise' the PCA, and highlight the most relevant sources of variation. After computing this transformation, we can project it onto a query dataset. We can also compute and project a PCA projection, but recommend the use of sPCA when working with multimodal references that have been constructed with WNN analysis.
The sPCA calculation is performed once, and then can be rapidly projected onto each query dataset.
```{r bm.spca}
bm <- ScaleData(bm, assay = 'RNA')
bm <- RunSPCA(bm, assay = 'RNA', graph = 'wsnn')
```
## Computing a cached neighbor index
Since we will be mapping multiple query samples to the same reference, we can cache particular steps that only involve the reference. This step is optional but will improve speed when mapping multiple samples.
We compute the first 50 neighbors in the sPCA space of the reference. We store this information in the `spca.annoy.neighbors` object within the reference Seurat object and also cache the annoy index data structure (via `cache.index = TRUE`).
```{r bm.nn, cache = FALSE}
bm <- FindNeighbors(
object = bm,
reduction = "spca",
dims = 1:50,
graph.name = "spca.annoy.neighbors",
k.param = 50,
cache.index = TRUE,
return.neighbor = TRUE,
l2.norm = TRUE
)
```
<details>
<summary>**How can I save and load a cached annoy index?**</summary>
If you want to save and load a cached index for a `Neighbor` object generated with `method = "annoy"` and `cache.index = TRUE`, use the `SaveAnnoyIndex()`/`LoadAnnoyIndex()` functions. Importantly, this index cannot be saved normally to an RDS or RDA file, so it will not persist correctly across R session restarts or `saveRDS`/`readRDS` for the Seurat object containing it. Instead, use `LoadAnnoyIndex()` to add the Annoy index to the `Neighbor` object every time R restarts or you load the reference Seurat object from RDS. The file created by `SaveAnnoyIndex()` can be distributed along with a reference Seurat object, and added to the `Neighbor` object in the reference.
```{r neighbor.demo}
bm[["spca.annoy.neighbors"]]
SaveAnnoyIndex(object = bm[["spca.annoy.neighbors"]], file = "../data/reftmp.idx")
bm[["spca.annoy.neighbors"]] <- LoadAnnoyIndex(object = bm[["spca.annoy.neighbors"]], file = "../data/reftmp.idx")
```
</details>
## Query dataset preprocessing
Here we will demonstrate mapping multiple donor bone marrow samples to the multimodal bone marrow reference. These query datasets are derived from the Human Cell Atlas (HCA) Immune Cell Atlas Bone marrow dataset and are available through SeuratData. This dataset is provided as a single merged object with 8 donors. We first split the data back into 8 separate Seurat objects, one for each original donor to map individually.
```{r bm40k.load}
library(dplyr)
library(SeuratData)
InstallData('hcabm40k')
hcabm40k.batches <- SplitObject(hcabm40k, split.by = "orig.ident")
```
We then normalize the query in the same manner as the reference. Here, the reference was normalized using log-normalization via `NormalizeData()`. If the reference had been normalized using `SCTransform()`, the query must be normalized with `SCTransform()` as well.
```{r 40k.norm}
hcabm40k.batches <- lapply(X = hcabm40k.batches, FUN = NormalizeData, verbose = FALSE)
```
## Mapping
We then find anchors between each donor query dataset and the multimodal reference. This command is optimized to minimize mapping time, by passing in a pre-computed set of reference neighbors, and turning off anchor filtration.
```{r bm.anchors}
anchors <- list()
for (i in 1:length(hcabm40k.batches)) {
anchors[[i]] <- FindTransferAnchors(
reference = bm,
query = hcabm40k.batches[[i]],
k.filter = NA,
reference.reduction = "spca",
reference.neighbors = "spca.annoy.neighbors",
dims = 1:50
)
}
```
We then individually map each of the datasets.
```{r bm.map}
for (i in 1:length(hcabm40k.batches)) {
hcabm40k.batches[[i]] <- MapQuery(
anchorset = anchors[[i]],
query = hcabm40k.batches[[i]],
reference = bm,
refdata = list(
celltype = "celltype.l2",
predicted_ADT = "ADT"),
reference.reduction = "spca",
reduction.model = "wnn.umap"
)
}
```
## Explore the mapping results
Now that mapping is complete, we can visualize the results for individual objects
```{r bm.umap.separate, fig.width=10}
p1 <- DimPlot(hcabm40k.batches[[1]], reduction = 'ref.umap', group.by = 'predicted.celltype', label.size = 3)
p2 <- DimPlot(hcabm40k.batches[[2]], reduction = 'ref.umap', group.by = 'predicted.celltype', label.size = 3)
p1 + p2 + plot_layout(guides = "collect")
```
We can also merge all the objects into one dataset. Note that they have all been integrated into a common space, defined by the reference. We can then visualize the results together.
```{r bm.umap.combine}
# Merge the batches
hcabm40k <- merge(hcabm40k.batches[[1]], hcabm40k.batches[2:length(hcabm40k.batches)], merge.dr = "ref.umap")
DimPlot(hcabm40k, reduction = "ref.umap", group.by = "predicted.celltype", label = TRUE, repel = TRUE, label.size = 3) + NoLegend()
```
We can visualize gene expression, cluster prediction scores, and (imputed) surface protein levels in the query cells:
```{r ftplot, fig.height = 10, fig.width=10}
p3 <- FeaturePlot(hcabm40k, features = c("rna_TRDC", "rna_MPO", "rna_AVP"), reduction = 'ref.umap',
max.cutoff = 3, ncol = 3)
# cell type prediction scores
DefaultAssay(hcabm40k) <- 'prediction.score.celltype'
p4 <- FeaturePlot(hcabm40k, features = c("CD16 Mono", "HSC", "Prog-RBC"), ncol = 3,
cols = c("lightgrey", "darkred"))
# imputed protein levels
DefaultAssay(hcabm40k) <- 'predicted_ADT'
p5 <- FeaturePlot(hcabm40k, features = c("CD45RA", "CD16", "CD161"), reduction = 'ref.umap',
min.cutoff = 'q10', max.cutoff = 'q99', cols = c("lightgrey", "darkgreen") ,
ncol = 3)
p3 / p4 / p5
```
```{r save.times, include = FALSE}
write.csv(x = t(as.data.frame(all_times)), file = "../output/timings/reference_mapping_times.csv")
```
<details>
<summary>**Session Info**</summary>
```{r}
sessionInfo()
```
</details>