Skip to content

Commit

Permalink
Update pseudobulk_DESeq2_scrnaseq.md
Browse files Browse the repository at this point in the history
  • Loading branch information
marypiper authored May 20, 2020
1 parent 327c1ca commit 108888b
Showing 1 changed file with 63 additions and 7 deletions.
70 changes: 63 additions & 7 deletions lessons/pseudobulk_DESeq2_scrnaseq.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ counts(sce)[1:6, 1:6]
```

<p align="center">
<img src="../img/sc_DE_countdata.png" width="800">
<img src="../img/sc_DE_countdata.png" width="300">
</p>

We see the raw counts data is a cell by gene sparse matrix with over 35,000 rows (genes) and nearly 30,000 columns (cells).
Expand All @@ -203,7 +203,7 @@ head(colData(sce))
```

<p align="center">
<img src="../img/sc_DE_coldata.png" width="800">
<img src="../img/sc_DE_coldata.png" width="300">
</p>

For every cell, we have information about the associated condition (ctrl or stim), sample ID, and cell type. We will use this information to perform the differential expression analysis between conditions for any particular cell type of interest.
Expand Down Expand Up @@ -251,7 +251,7 @@ ei
```

<p align="center">
<img src="../img/sc_DE_ei_metadata.png" width="800">
<img src="../img/sc_DE_ei_metadata.png" width="300">
</p>

Prior to performing the aggregation of cells to the sample level, we want to make sure that the poor quality cells are removed if this step hasn't already been performed. Generally, we would recommend a more stringent and hands-on exploration of the quality control metrics and more nuanced picking of filtering thresholds, as detailed [here](https://hbctraining.github.io/scRNA-seq/lessons/04_SC_quality_control.html); however, to proceed more quickly to the differential expression analysis, we are only going to remove count outliers and low count genes using functions from the `scater` package as performed in the [Bioconductor tutorial](http://biocworkshops2019.bioconductor.org.s3-website-us-east-1.amazonaws.com/page/muscWorkshop__vignette/).
Expand Down Expand Up @@ -299,6 +299,10 @@ dim(pb)
pb[1:6, 1:6]
```

<p align="center">
<img src="../img/sc_DE_pb_matrix.png" width="300">
</p>

The output of this aggregation is a sparse matrix, and when we take a quick look, we can see that it is a gene by cell type-sample matrix.

For example, within B cells, sample `ctrl101` has 12 counts associated with gene NOC2L.
Expand Down Expand Up @@ -333,12 +337,24 @@ class(pb)

# Explore the different components of list
str(pb)
```

<p align="center">
<img src="../img/sc_DE_pb_list.png" width="300">
</p>

The counts per sample for each cluster can be checked:

```r
# Print out the table of cells in each cluster-sample group
options(width = 100)
table(sce$cluster_id, sce$sample_id)
```

<p align="center">
<img src="../img/sc_DE_sample_level_counts.png" width="300">
</p>

## Differential gene expression with DESeq2

**We will be using [DESeq2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8) for the DE analysis, and the analysis steps with DESeq2 are shown in the flowchart below in green**. DESeq2 first normalizes the count data to account for differences in library sizes and RNA composition between samples. Then, we will use the normalized counts to make some plots for QC at the gene and sample level. The final step is to use the appropriate functions from the DESeq2 package to perform the differential expression analysis. We will go in-depth into each of these steps in the following lessons, but additional details and helpful suggestions regarding DESeq2 can be found in our materials detailing the workflow on [bulk RNA-seq data]() and the [DESeq2 vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).
Expand All @@ -352,6 +368,8 @@ To perform the DE analysis, we need metadata for all samples, including cluster
First, we will create a vector of sample names combined for each of the cell type clusters.

```r
# Get sample names for each of the cell type clusters

# prep. data.frame for plotting
get_sample_ids <- function(x){
pb[[x]] %>%
Expand All @@ -365,6 +383,7 @@ de_samples <- map(1:length(kids), get_sample_ids) %>%
Then we can get the cluster IDs corresponding to each of the samples in the vector.

```r
# Get cluster IDs for each of the samples

samples_list <- map(1:length(kids), get_sample_ids)

Expand All @@ -380,6 +399,8 @@ de_cluster_ids <- map(1:length(kids), get_cluster_ids) %>%
Finally, let's create a data frame with the cluster IDs and the corresponding sample IDs. We will merge together the condition information.

```r
# Create a data frame with the sample IDs, cluster IDs and condition

gg_df <- data.frame(cluster_id = de_cluster_ids,
sample_id = de_samples)

Expand All @@ -388,8 +409,14 @@ gg_df <- left_join(gg_df, ei[, c("sample_id", "group_id")])

metadata <- gg_df %>%
dplyr::select(cluster_id, sample_id, group_id)

metadata
```

<p align="center">
<img src="../img/sc_DE_sample_level_metadata.png" width="300">
</p>

### Subsetting dataset to cluster(s) of interest

Now that we have the sample-level metadata, we can run the differential expression analysis with DESeq2. Oftentimes, we would like to perform the analysis on multiple different clusters, so we can set up the workflow to run easily on any of our clusters.
Expand Down Expand Up @@ -484,9 +511,12 @@ rld <- rlog(dds, blind=TRUE)
# Plot PCA

DESeq2::plotPCA(rld, intgroup = "group_id")
ggsave(paste0("results/", clusters[1], "_specific_PCAplot.png"))
```

<p align="center">
<img src="../img/sc_DE_pca.png" width="300">
</p>

We see a nice separation between our samples on PC1 by our condition of interest, which is great; this suggests that our condition of interest is the largest source of variation in our dataset. We also see some separation of the samples by PC2; however, it is uncertain what this might be due to since we lack additional metadata to explore.

#### Hierarchical clustering
Expand All @@ -505,6 +535,10 @@ rld_cor <- cor(rld_mat)
pheatmap(rld_cor, annotation = cluster_metadata[, c("group_id"), drop=F])
```

<p align="center">
<img src="../img/sc_DE_heatmap.png" width="300">
</p>

Now we determine whether we have any outliers that need removing or additional sources of variation that we might want to regress out in our design formula. Since we detected no outliers by PCA or hierarchical clustering, nor do we have any additional sources of variation to regress, we can proceed with running the differential expression analysis.

### Running DESeq2
Expand All @@ -528,6 +562,10 @@ We can check the fit of the model to our data by looking at the plot of dispersi
plotDispEsts(dds)
```

<p align="center">
<img src="../img/sc_DE_dispersion.png" width="300">
</p>

The plot is encouraging, since we expect our dispersions to decrease with increasing mean and follow the line of best fit.

## Results
Expand Down Expand Up @@ -581,7 +619,11 @@ write.csv(res_tbl,
quote = FALSE,
row.names = FALSE)
```


<p align="center">
<img src="../img/sc_DE_res_tbl.png" width="300">
</p>

### Table of results for significant genes

Next, we can filter our table for only the significant genes using a p-adjusted threshold of 0.05
Expand All @@ -604,6 +646,10 @@ write.csv(sig_res,
row.names = FALSE)
```

<p align="center">
<img src="../img/sc_DE_sig_res.png" width="300">
</p>

### Scatterplot of normalized expression of top 20 most significant genes

Now that we have identified the significant genes, we can plot a scatterplot of the top 20 significant genes. This plot is a good check to make sure that we are interpreting our fold change values correctly, as well.
Expand Down Expand Up @@ -642,10 +688,12 @@ ggplot(gathered_top20_sig) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(hjust = 0.5))


```

<p align="center">
<img src="../img/sc_DE_top20.png" width="300">
</p>

### Heatmap of all significant genes

We can also explore the clustering of the significant genes using the heatmap.
Expand All @@ -672,6 +720,10 @@ pheatmap(sig_norm[ , 2:length(colnames(sig_norm))],
height = 20)
```

<p align="center">
<img src="../img/sc_DE_sig_genes_heatmap.png" width="300">
</p>

### Volcano plot of results

```r
Expand All @@ -691,6 +743,10 @@ ggplot(res_table_thres) +
axis.title = element_text(size = rel(1.25)))
```

<p align="center">
<img src="../img/sc_DE_volcano.png" width="300">
</p>

***

# Useful scripts for running analyses on many different cell type clusters using Wald test for pairwise comparisons or Likelihood Ratio Test for multi-group comparisons
Expand Down

0 comments on commit 108888b

Please sign in to comment.