Update pseudobulk_DESeq2_scrnaseq.md

gaochenxuzi · May 20, 2020 · 108888b · 108888b
1 parent 327c1ca
commit 108888b
Showing 1 changed file with 63 additions and 7 deletions.
diff --git a/lessons/pseudobulk_DESeq2_scrnaseq.md b/lessons/pseudobulk_DESeq2_scrnaseq.md
@@ -186,7 +186,7 @@ counts(sce)[1:6, 1:6]
 ```
 
 <p align="center">
-<img src="../img/sc_DE_countdata.png" width="800">
+<img src="../img/sc_DE_countdata.png" width="300">
 </p>
 
 We see the raw counts data is a cell by gene sparse matrix with over 35,000 rows (genes) and nearly 30,000 columns (cells). 
@@ -203,7 +203,7 @@ head(colData(sce))
 ```
 
 <p align="center">
-<img src="../img/sc_DE_coldata.png" width="800">
+<img src="../img/sc_DE_coldata.png" width="300">
 </p>
 
 For every cell, we have information about the associated condition (ctrl or stim), sample ID, and cell type. We will use this information to perform the differential expression analysis between conditions for any particular cell type of interest.
@@ -251,7 +251,7 @@ ei
 ```
 
 <p align="center">
-<img src="../img/sc_DE_ei_metadata.png" width="800">
+<img src="../img/sc_DE_ei_metadata.png" width="300">
 </p>
 
 Prior to performing the aggregation of cells to the sample level, we want to make sure that the poor quality cells are removed if this step hasn't already been performed. Generally, we would recommend a more stringent and hands-on exploration of the quality control metrics and more nuanced picking of filtering thresholds, as detailed [here](https://hbctraining.github.io/scRNA-seq/lessons/04_SC_quality_control.html); however, to proceed more quickly to the differential expression analysis, we are only going to remove count outliers and low count genes using functions from the `scater` package as performed in the [Bioconductor tutorial](http://biocworkshops2019.bioconductor.org.s3-website-us-east-1.amazonaws.com/page/muscWorkshop__vignette/).
@@ -299,6 +299,10 @@ dim(pb)
 pb[1:6, 1:6]
 ```
 
+<p align="center">
+<img src="../img/sc_DE_pb_matrix.png" width="300">
+</p>
+
 The output of this aggregation is a sparse matrix, and when we take a quick look, we can see that it is a gene by cell type-sample matrix.
 
 For example, within B cells, sample `ctrl101` has 12 counts associated with gene NOC2L. 
@@ -333,12 +337,24 @@ class(pb)
 
 # Explore the different components of list
 str(pb)
+```
+
+<p align="center">
+<img src="../img/sc_DE_pb_list.png" width="300">
+</p>
+
+The counts per sample for each cluster can be checked:
 
+```r
 # Print out the table of cells in each cluster-sample group
 options(width = 100)
 table(sce$cluster_id, sce$sample_id)
 ```
 
+<p align="center">
+<img src="../img/sc_DE_sample_level_counts.png" width="300">
+</p>
+
 ## Differential gene expression with DESeq2
 
 **We will be using [DESeq2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8) for the DE analysis, and the analysis steps with DESeq2 are shown in the flowchart below in green**. DESeq2 first normalizes the count data to account for differences in library sizes and RNA composition between samples. Then, we will use the normalized counts to make some plots for QC at the gene and sample level. The final step is to use the appropriate functions from the DESeq2 package to perform the differential expression analysis. We will go in-depth into each of these steps in the following lessons, but additional details and helpful suggestions regarding DESeq2 can be found in our materials detailing the workflow on [bulk RNA-seq data]() and the [DESeq2 vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).
@@ -352,6 +368,8 @@ To perform the DE analysis, we need metadata for all samples, including cluster
 First, we will create a vector of sample names combined for each of the cell type clusters.
 
 ```r
+# Get sample names for each of the cell type clusters
+
 # prep. data.frame for plotting
 get_sample_ids <- function(x){
         pb[[x]] %>%
@@ -365,6 +383,7 @@ de_samples <- map(1:length(kids), get_sample_ids) %>%
 Then we can get the cluster IDs corresponding to each of the samples in the vector.
 
 ```r
+# Get cluster IDs for each of the samples
 
 samples_list <- map(1:length(kids), get_sample_ids)
 
@@ -380,6 +399,8 @@ de_cluster_ids <- map(1:length(kids), get_cluster_ids) %>%
 Finally, let's create a data frame with the cluster IDs and the corresponding sample IDs. We will merge together the condition information.
 
 ```r
+# Create a data frame with the sample IDs, cluster IDs and condition
+
 gg_df <- data.frame(cluster_id = de_cluster_ids,
                     sample_id = de_samples)
 
@@ -388,8 +409,14 @@ gg_df <- left_join(gg_df, ei[, c("sample_id", "group_id")])
 
 metadata <- gg_df %>%
         dplyr::select(cluster_id, sample_id, group_id) 
+
+metadata        
 ```
 
+<p align="center">
+<img src="../img/sc_DE_sample_level_metadata.png" width="300">
+</p>
+
 ### Subsetting dataset to cluster(s) of interest
 
 Now that we have the sample-level metadata, we can run the differential expression analysis with DESeq2. Oftentimes, we would like to perform the analysis on multiple different clusters, so we  can set up the workflow to run easily on any of our clusters.
@@ -484,9 +511,12 @@ rld <- rlog(dds, blind=TRUE)
 # Plot PCA
 
 DESeq2::plotPCA(rld, intgroup = "group_id")
-ggsave(paste0("results/", clusters[1], "_specific_PCAplot.png"))
 ```
 
+<p align="center">
+<img src="../img/sc_DE_pca.png" width="300">
+</p>
+
 We see a nice separation between our samples on PC1 by our condition of interest, which is great; this suggests that our condition of interest is the largest source of variation in our dataset. We also see some separation of the samples by PC2; however, it is uncertain what this might be due to since we lack additional metadata to explore.
 
 #### Hierarchical clustering
@@ -505,6 +535,10 @@ rld_cor <- cor(rld_mat)
 pheatmap(rld_cor, annotation = cluster_metadata[, c("group_id"), drop=F])
 ```
 
+<p align="center">
+<img src="../img/sc_DE_heatmap.png" width="300">
+</p>
+
 Now we determine whether we have any outliers that need removing or additional sources of variation that we might want to regress out in our design formula. Since we detected no outliers by PCA or hierarchical clustering, nor do we have any additional sources of variation to regress, we can proceed with running the differential expression analysis.
 
 ### Running DESeq2
@@ -528,6 +562,10 @@ We can check the fit of the model to our data by looking at the plot of dispersi
 plotDispEsts(dds)
 ```
 
+<p align="center">
+<img src="../img/sc_DE_dispersion.png" width="300">
+</p>
+
 The plot is encouraging, since we expect our dispersions to decrease with increasing mean and follow the line of best fit.
 
 ## Results
@@ -581,7 +619,11 @@ write.csv(res_tbl,
           quote = FALSE, 
           row.names = FALSE)
 ```
-
+
+<p align="center">
+<img src="../img/sc_DE_res_tbl.png" width="300">
+</p>
+
 ### Table of results for significant genes
 
 Next, we can filter our table for only the significant genes using a p-adjusted threshold of 0.05
@@ -604,6 +646,10 @@ write.csv(sig_res,
           row.names = FALSE)
 ```
 
+<p align="center">
+<img src="../img/sc_DE_sig_res.png" width="300">
+</p>
+
 ### Scatterplot of normalized expression of top 20 most significant genes
 
 Now that we have identified the significant genes, we can plot a scatterplot of the top 20 significant genes. This plot is a good check to make sure that we are interpreting our fold change values correctly, as well.
@@ -642,10 +688,12 @@ ggplot(gathered_top20_sig) +
         theme_bw() +
         theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
         theme(plot.title = element_text(hjust = 0.5))
-
-
 ```
 
+<p align="center">
+<img src="../img/sc_DE_top20.png" width="300">
+</p>
+
 ### Heatmap of all significant genes
 
 We can also explore the clustering of the significant genes using the heatmap.
@@ -672,6 +720,10 @@ pheatmap(sig_norm[ , 2:length(colnames(sig_norm))],
     height = 20)        
 ```
 
+<p align="center">
+<img src="../img/sc_DE_sig_genes_heatmap.png" width="300">
+</p>
+
 ### Volcano plot of results
 
 ```r
@@ -691,6 +743,10 @@ ggplot(res_table_thres) +
           axis.title = element_text(size = rel(1.25)))                    
 ```
 
+<p align="center">
+<img src="../img/sc_DE_volcano.png" width="300">
+</p>
+
 ***
 
 # Useful scripts for running analyses on many different cell type clusters using Wald test for pairwise comparisons or Likelihood Ratio Test for multi-group comparisons