vignettes/multimodal_reference_mapping.Rmd

---
title: "Multimodal reference mapping"
output:
  html_document:
    theme: united
    df_print: kable
date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`'
---

```{r setup, include=FALSE}
all_times <- list()  # store the time for each chunk
knitr::knit_hooks$set(time_it = local({
  now <- NULL
  function(before, options) {
    if (before) {
      now <<- Sys.time()
    } else {
      res <- difftime(Sys.time(), now)
      all_times[[options$label]] <<- res
    }
  }
}))
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  time_it = TRUE
)
```

# Intro: Seurat v4 Reference Mapping

This vignette introduces the process of mapping query datasets to annotated references in Seurat. In this example, we map one of the first scRNA-seq datasets released by 10X Genomics of 2,700 PBMC to our [recently described CITE-seq reference of 162,000 PBMC measured with 228 antibodies](https://doi.org/10.1016/j.cell.2021.04.048). We chose this example to demonstrate how supervised analysis guided by a reference dataset can help to enumerate cell states that would be challenging to find with [unsupervised analysis](pbmc3k_tutorial.html). In a second example, we demonstrate how to serially map Human Cell Atlas datasets of human BMNC profiled from different individuals onto a consistent reference.

We have [previously demonstrated](integration_mapping.html) how to use reference-mapping approach to annotate cell labels in a query dataset . In Seurat v4, we have substantially improved the speed and memory requirements for integrative tasks including reference mapping, and also include new functionality to project query cells onto a previously computed UMAP visualization.

In this vignette, we demonstrate how to use a previously established reference to interpret an scRNA-seq query:

* Annotate each query cell based on a set of reference-defined cell states
* Project each query cell onto a previously computed UMAP visualization
* Impute the predicted levels of surface proteins that were measured in the CITE-seq reference

To run this vignette please install Seurat v4, available on CRAN. Additionally, you will need to install the `SeuratDisk` package.

```{r install, eval = FALSE}
install.packages("Seurat")
remotes::install_github("mojaveazure/seurat-disk")
```

```{r packages, cache=FALSE}
library(Seurat)
library(SeuratDisk)
library(ggplot2)
library(patchwork)
```

```{r, include = FALSE, cache=FALSE}
options(SeuratData.repo.use = "http://satijalab04.nygenome.org")
```

# Example 1: Mapping human peripheral blood cells 

## A Multimodal PBMC Reference Dataset

We load the reference (download [here](https://atlas.fredhutch.org/data/nygc/multimodal/pbmc_multimodal.h5seurat)) from our recent [paper](https://doi.org/10.1016/j.cell.2021.04.048), and visualize the pre-computed UMAP. This reference is stored as an h5Seurat file, a format that enables on-disk storage of multimodal Seurat objects (more details on h5Seurat and `SeuratDisk` can be found [here](https://mojaveazure.github.io/seurat-disk/index.html)).

```{r pbmc.ref}
reference <- LoadH5Seurat("../data/pbmc_multimodal.h5seurat")
```

```{r ref.dimplot}
DimPlot(object = reference, reduction = "wnn.umap", group.by = "celltype.l2", label = TRUE, label.size = 3, repel = TRUE) + NoLegend()
```

## Mapping

To demonstrate mapping to this multimodal reference, we will use a dataset of 2,700 PBMCs generated by 10x Genomics and available via `SeuratData`. 

```{r 3k.load}
library(SeuratData)
InstallData('pbmc3k')
```

The reference was normalized using `SCTransform()`, so we use the same approach to normalize the query here.

```{r 3k.preprocess, results="hide"}
pbmc3k <- SCTransform(pbmc3k, verbose = FALSE)
```

We then find anchors between reference and query. As described in the [manuscript](https://doi.org/10.1016/j.cell.2021.04.048), we used a precomputed supervised PCA (spca) transformation for this example. We recommend the use of supervised PCA for CITE-seq datasets, and demonstrate how to compute this transformation on the next tab of this vignette. However, you can also use a standard PCA transformation.

```{r transfer.anchors}
anchors <- FindTransferAnchors(
  reference = reference,
  query = pbmc3k,
  normalization.method = "SCT",
  reference.reduction = "spca",
  dims = 1:50
)
```

We then transfer cell type labels and protein data from the reference to the query. Additionally, we project the query data onto the UMAP structure of the reference. 

```{r transfer}
pbmc3k <- MapQuery(
  anchorset = anchors,
  query = pbmc3k,
  reference = reference,
  refdata = list(
    celltype.l1 = "celltype.l1",
    celltype.l2 = "celltype.l2",
    predicted_ADT = "ADT"
  ),
  reference.reduction = "spca", 
  reduction.model = "wnn.umap"
)
```

<details>
  <summary>**What is `MapQuery` doing?**</summary>

  `MapQuery()` is a wrapper around three functions: `TransferData()`, `IntegrateEmbeddings()`, and `ProjectUMAP()`. `TransferData()` is used to transfer cell type labels and impute the ADT values. `IntegrateEmbeddings()` and `ProjectUMAP()` are used to project the query data onto the UMAP structure of the reference. The equivalent code for doing this with the intermediate functions is below:


```{r, eval=FALSE}
pbmc3k <- TransferData(
  anchorset = anchors, 
  reference = reference,
  query = pbmc3k,
  refdata = list(
    celltype.l1 = "celltype.l1",
    celltype.l2 = "celltype.l2",
    predicted_ADT = "ADT")
)
pbmc3k <- IntegrateEmbeddings(
  anchorset = anchors,
  reference = reference,
  query = pbmc3k, 
  new.reduction.name = "ref.spca"
)
pbmc3k <- ProjectUMAP(
  query = pbmc3k, 
  query.reduction = "ref.spca", 
  reference = reference, 
  reference.reduction = "spca", 
  reduction.model = "wnn.umap"
)
```
</details>

## Explore the mapping results

We can now visualize the 2,700 query cells. They have been projected into a UMAP visualization defined by the reference, and each has received annotations at two levels of granularity (level 1, and level 2).

```{r 3k.refdimplots, fig.width=10}
p1 = DimPlot(pbmc3k, reduction = "ref.umap", group.by = "predicted.celltype.l1", label = TRUE, label.size = 3, repel = TRUE) + NoLegend()
p2 = DimPlot(pbmc3k, reduction = "ref.umap", group.by = "predicted.celltype.l2", label = TRUE, label.size = 3 ,repel = TRUE) + NoLegend()
p1 + p2
```

The reference-mapped dataset helps us identify cell types that were previously blended in an [unsupervised analysis of the query dataset](pbmc3k_tutorial.html). Just a few examples include plasmacytoid dendritic cells (pDC), hematopoietic stem and progenitor cells (HSPC), regulatory T cells (Treg), CD8 Naive T cells, cells, CD56+ NK cells, memory, and naive B cells, and plasmablasts.

Each prediction is assigned a score between 0 and 1.

```{r 3k.featureplots1, fig.width = 10, fig.height =4}
FeaturePlot(pbmc3k, features = c("pDC", "CD16 Mono", "Treg"),  reduction = "ref.umap", cols = c("lightgrey", "darkred"), ncol = 3) & theme(plot.title = element_text(size = 10))
```

```{r save.img, include=FALSE}
library(ggplot2)
plot <- FeaturePlot(pbmc3k, features = "CD16 Mono",  reduction = "ref.umap", cols = c("lightgrey", "darkred"))  + ggtitle("CD16 Mono") + theme(plot.title = element_text(hjust = 0.5, size = 30)) + labs(color = "Prediction Score") +  xlab("UMAP 1") + ylab("UMAP 2") + 
  theme(axis.title = element_text(size = 18), legend.text = element_text(size = 18), legend.title = element_text(size = 25)) 
ggsave(filename = "../output/images/multimodal_reference_mapping.jpg", height = 7, width = 12, plot = plot, quality = 50)
```

We can verify our predictions by exploring the expression of canonical marker genes. For example, CLEC4C and LIRA4 have been [reported](https://pubmed.ncbi.nlm.nih.gov/30395816/) as markers of pDC identity, consistent with our predictions. Similarly, if we perform differential expression to identify markers of Tregs, we identify a set of [canonical markers](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4761514/) including RTKN2, CTLA4, FOXP3, and IL2RA.

```{r 3k.VlnPlots, fig.height=6, fig.width = 10, fig.height =5}
Idents(pbmc3k) <- 'predicted.celltype.l2'
VlnPlot(pbmc3k, features = c("CLEC4C", "LILRA4"), sort = TRUE) + NoLegend()

treg_markers <- FindMarkers(pbmc3k, ident.1 = "Treg", only.pos = TRUE, logfc.threshold = 0.1)
print(head(treg_markers))
```

Finally, we can visualize the imputed levels of surface protein, which were inferred based on the CITE-seq reference. 

```{r 3k.featureplots2, fig.width=10, fig.height =4}
DefaultAssay(pbmc3k) <- 'predicted_ADT'
# see a list of proteins: rownames(pbmc3k)
FeaturePlot(pbmc3k, features = c("CD3-1", "CD45RA", "IgD"), reduction = "ref.umap", cols = c("lightgrey", "darkgreen"), ncol = 3)
```

## Computing a new UMAP visualiztion 

In the previous examples, we visualize the query cells after mapping to the reference-derived UMAP. Keeping a consistent visualization can assist with the interpretation of new datasets. However, if there are cell states that are present in the query dataset that are not represented in the reference, they will project to the most similar cell in the reference. This is the expected behavior and functionality as established by the UMAP package, but can potentially mask the presence of new cell types in the query which may be of interest.  

In our [manuscript](https://doi.org/10.1016/j.cell.2021.04.048), we map a query dataset containing developing and differentiated neutrophils, which are not included in our reference. We find that computing a new UMAP ('de novo visualization') after merging the reference and query can help to identify these populations, as demonstrated in Supplementary Figure 8. In the 'de novo' visualization, unique cell states in the query remain separated. In this example, the 2,700 PBMC does not contain unique cell states, but we demonstrate how to compute this visualization below.

We emphasize that if users are attempting to map datasets where the underlying samples are not PBMC, or contain cell types that are not present in the reference, computing a 'de novo' visualization is an important step in interpreting their dataset.

```{r hiddendiet, include=FALSE}
reference <- DietSeurat(reference, counts = FALSE, dimreducs = "spca")
pbmc3k <- DietSeurat(pbmc3k, counts = FALSE, dimreducs = "ref.spca")
```

```{r denovoumap}
#merge reference and query
reference$id <- 'reference'
pbmc3k$id <- 'query'
refquery <- merge(reference, pbmc3k)
refquery[["spca"]] <- merge(reference[["spca"]], pbmc3k[["ref.spca"]])
refquery <- RunUMAP(refquery, reduction = 'spca', dims = 1:50)
DimPlot(refquery, group.by = 'id', shuffle = TRUE)
```

# Example 2: Mapping human bone marrow cells

## A Multimodal BMNC Reference Dataset

As a second example, we map a dataset of human bone marrow mononuclear (BMNC) cells from eight individual donors, produced by the Human Cell Atlas. As a reference, we use the CITE-seq reference of human BMNC that we analyzed using [weighted-nearest neighbor analysis (WNN)](weighted_nearest_neighbor_analysis.html). 

This vignette exhibits the same reference-mapping functionality as the PBMC example on the previous tab. In addition, we also demonstrate:

* How to construct a supervised PCA (sPCA) transformation
* How to serially map multiple datasets to the same reference
* Optimization steps to further enhance to speed of mapping

```{r bmref.seuratdata}
# Both datasets are available through SeuratData
library(SeuratData)
#load reference data
InstallData("bmcite")
bm <- LoadData(ds = "bmcite")
#load query data
InstallData('hcabm40k')
hcabm40k <- LoadData(ds = "hcabm40k")
```

The reference dataset contains a [WNN graph](weighted_nearest_neighbor_analysis.html), reflecting a weighted combination of the RNA and protein data in this CITE-seq experiment. 

We can compute a UMAP visualization based on this graph. We set `return.model = TRUE`, which will enable us to project query datasets onto this visualization.

```{r bm.refdimplot, fig.width=8}
bm <- RunUMAP(bm, nn.name = "weighted.nn", reduction.name = "wnn.umap", 
              reduction.key = "wnnUMAP_", return.model = TRUE)
DimPlot(bm, group.by = "celltype.l2", reduction = "wnn.umap") 
```

## Computing an sPCA transformation

As described in our [manuscript](https://doi.org/10.1016/j.cell.2021.04.048), we first compute a 'supervised' PCA. This identifies the  transformation of the transcriptome data that best encapsulates the structure of the WNN graph. This allows a weighted combination of the protein and RNA measurements to 'supervise' the PCA, and highlight the most relevant sources of variation. After computing this transformation, we can project it onto a query dataset. We can also compute and project a PCA projection, but recommend the use of sPCA when working with multimodal references that have been constructed with WNN analysis.

The sPCA calculation is performed once, and then can be rapidly projected onto each query dataset.

```{r bm.spca}
bm <- ScaleData(bm, assay = 'RNA')
bm <- RunSPCA(bm, assay = 'RNA', graph = 'wsnn')
```

## Computing a cached neighbor index

Since we will be mapping multiple query samples to the same reference, we can cache particular steps that only involve the reference. This step is optional but will improve speed when mapping multiple samples.

We compute the first 50 neighbors in the sPCA space of the reference. We store this information in the `spca.annoy.neighbors` object within the reference Seurat object and also cache the annoy index data structure (via `cache.index = TRUE`).

```{r bm.nn, cache = FALSE}
bm <- FindNeighbors(
  object = bm,
  reduction = "spca",
  dims = 1:50,
  graph.name = "spca.annoy.neighbors", 
  k.param = 50,
  cache.index = TRUE,
  return.neighbor = TRUE,
  l2.norm = TRUE
)
```

<details>
  <summary>**How can I save and load a cached annoy index?**</summary>

If you want to save and load a cached index for a `Neighbor` object generated with `method = "annoy"` and `cache.index = TRUE`, use the `SaveAnnoyIndex()`/`LoadAnnoyIndex()` functions. Importantly, this index cannot be saved normally to an RDS or RDA file, so it will not persist correctly across R session restarts or `saveRDS`/`readRDS` for the Seurat object containing it. Instead, use `LoadAnnoyIndex()` to add the Annoy index to the `Neighbor` object every time R restarts or you load the reference Seurat object from RDS. The file created by `SaveAnnoyIndex()` can be distributed along with a reference Seurat object, and added to the `Neighbor` object in the reference.

```{r neighbor.demo}
bm[["spca.annoy.neighbors"]]
SaveAnnoyIndex(object = bm[["spca.annoy.neighbors"]], file = "../data/reftmp.idx")
bm[["spca.annoy.neighbors"]] <- LoadAnnoyIndex(object = bm[["spca.annoy.neighbors"]], file = "../data/reftmp.idx")
```
</details>

## Query dataset preprocessing

Here we will demonstrate mapping multiple donor bone marrow samples to the multimodal bone marrow reference. These query datasets are derived from the Human Cell Atlas (HCA) Immune Cell Atlas Bone marrow dataset and are available through SeuratData. This dataset is provided as a single merged object with 8 donors. We first split the data back into 8 separate Seurat objects, one for each original donor to map individually.

```{r bm40k.load}
library(dplyr)
library(SeuratData)
InstallData('hcabm40k')
hcabm40k.batches <- SplitObject(hcabm40k, split.by = "orig.ident")
```

We then normalize the query in the same manner as the reference. Here, the reference was normalized using log-normalization via `NormalizeData()`. If the reference had been normalized using `SCTransform()`, the query must be normalized with `SCTransform()` as well.

```{r 40k.norm}
hcabm40k.batches <- lapply(X = hcabm40k.batches, FUN = NormalizeData, verbose = FALSE)
```

## Mapping

We then find anchors between each donor query dataset and the multimodal reference. This command is optimized to minimize mapping time, by passing in a pre-computed set of reference neighbors, and turning off anchor filtration. 

```{r bm.anchors}
anchors <- list()
for (i in 1:length(hcabm40k.batches)) {
  anchors[[i]] <- FindTransferAnchors(
    reference = bm,
    query = hcabm40k.batches[[i]],
    k.filter = NA,
    reference.reduction = "spca", 
    reference.neighbors = "spca.annoy.neighbors", 
    dims = 1:50
  )
}
```

We then individually map each of the datasets.

```{r bm.map}
for (i in 1:length(hcabm40k.batches)) {
  hcabm40k.batches[[i]] <- MapQuery(
    anchorset = anchors[[i]], 
    query = hcabm40k.batches[[i]],
    reference = bm, 
    refdata = list(
      celltype = "celltype.l2", 
      predicted_ADT = "ADT"),
    reference.reduction = "spca",
    reduction.model = "wnn.umap"
  )
}
```

## Explore the mapping results

Now that mapping is complete, we can visualize the results for individual objects

```{r bm.umap.separate, fig.width=10}
p1 <- DimPlot(hcabm40k.batches[[1]], reduction = 'ref.umap', group.by = 'predicted.celltype', label.size = 3)
p2 <- DimPlot(hcabm40k.batches[[2]], reduction = 'ref.umap', group.by = 'predicted.celltype', label.size = 3)
p1 + p2 + plot_layout(guides = "collect")
```

We can also merge all the objects into one dataset. Note that they have all been integrated into a common space, defined by the reference. We can then visualize the results together.

```{r bm.umap.combine}
# Merge the batches 
hcabm40k <- merge(hcabm40k.batches[[1]], hcabm40k.batches[2:length(hcabm40k.batches)], merge.dr = "ref.umap")
DimPlot(hcabm40k, reduction = "ref.umap", group.by =  "predicted.celltype", label = TRUE, repel = TRUE, label.size = 3) + NoLegend()
```

We can visualize gene expression, cluster prediction scores, and (imputed) surface protein levels in the query cells:

```{r ftplot, fig.height = 10, fig.width=10}
p3 <- FeaturePlot(hcabm40k, features = c("rna_TRDC", "rna_MPO", "rna_AVP"), reduction = 'ref.umap', 
                  max.cutoff = 3, ncol = 3)

# cell type prediction scores
DefaultAssay(hcabm40k) <- 'prediction.score.celltype'
p4 <- FeaturePlot(hcabm40k, features = c("CD16 Mono", "HSC", "Prog-RBC"), ncol = 3, 
                  cols = c("lightgrey", "darkred"))

# imputed protein levels
DefaultAssay(hcabm40k) <- 'predicted_ADT'
p5 <- FeaturePlot(hcabm40k, features = c("CD45RA", "CD16", "CD161"), reduction = 'ref.umap',
                  min.cutoff = 'q10', max.cutoff = 'q99', cols = c("lightgrey", "darkgreen") ,
                  ncol = 3)
p3 / p4 / p5
```

```{r save.times, include = FALSE}
write.csv(x = t(as.data.frame(all_times)), file = "../output/timings/reference_mapping_times.csv")
```

<details>
  <summary>**Session Info**</summary>
```{r}
sessionInfo()
```
</details>