forked from microbiome/OMA
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path24_biclustering.Rmd
382 lines (299 loc) · 12 KB
/
24_biclustering.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
# Biclustering {#biclustering}
```{r setup, echo=FALSE, results="asis"}
library(rebook)
chapterPreamble()
```
Biclustering methods cluster rows and columns simultaneously in order
to find subsets of correlated features/samples.
There are multiple biclustering packages available, In this example, we use following:
- biclust [@Kaiser2021]
- cobiclust [@Aubert2021]
_cobiclust_ is especially developed for microbiome data whereas _biclust_ is more
general method. In this section, we show three different cases and example
solutions to apply biclustering to them.
1. Taxa vs samples
2. Taxa vs biomolecule/biomarker
3. Taxa vs taxa
Biclusters can be visualized using heatmap or boxplot, for instance. For checking purposes,
also scatter plot might be valid choice.
Check more ideas for heatmaps from chapters \@ref(viz-chapter) and \@ref(microbiome-community.
## Taxa vs samples
When you have microbial abundance matrices, we suggest to use _cobiclust_ which is
designed for microbial data.
Load example data
```{r load-pkg-data}
library(mia)
data(HintikkaXOData, package="mia")
mae <- HintikkaXOData
```
Only the most prevalent taxa are included in analysis.
```{r cobiclust_1}
# Subset data in the first experiment
mae[[1]] <- subsetByPrevalentTaxa(mae[[1]], rank = "Genus", prevalence = 0.2, detection = 0.001)
# clr-transform in the first experiment
mae[[1]] <- transformCounts(mae[[1]], method = "relabundance")
mae[[1]] <- transformCounts(mae[[1]], "relabundance", method = "rclr")
```
_cobiclust_ takes counts table as an input and gives _cobiclust_ object as an output.
It includes clusters for taxa and samples.
```{r cobiclust_2}
if(!require(cobiclust)){
install.packages("cobiclust")
library(cobiclust)
}
# Do clustering; use counts table´
clusters <- cobiclust(assay(mae[[1]], "counts"))
# Get clusters
row_clusters <- clusters$classification$rowclass
col_clusters <- clusters$classification$colclass
# Add clusters to rowdata and coldata
rowData(mae[[1]])$clusters <- factor(row_clusters)
colData(mae[[1]])$clusters <- factor(col_clusters)
# Order data based on clusters
mae[[1]] <- mae[[1]][order(rowData(mae[[1]])$clusters), order(colData(mae[[1]])$clusters)]
# Print clusters
clusters$classification
```
Next we can plot clusters. Commonly used plot is heatmap with annotations.
```{r cobiclust_3, fig.width=14, fig.height=12}
if(!require(pheatmap)){
install.packages("pheatmap")
library(pheatmap)
}
# z-transform for heatmap
mae[[1]] <- transformCounts(mae[[1]], assay_name = "rclr",
MARGIN = "features",
method = "z", name = "clr_z")
# Create annotations. When column names are equal, they should share levels.
# Here samples include 3 clusters, and taxa 2. That is why we have to make
# column names unique.
annotation_col <- data.frame(colData(mae[[1]])[, "clusters", drop = F])
colnames(annotation_col) <- "col_clusters"
annotation_row <- data.frame(rowData(mae[[1]])[, "clusters", drop = F])
colnames(annotation_row) <- "row_clusters"
# Create a heatmap
pheatmap(assay(mae[[1]], "clr_z"), cluster_rows = F, cluster_cols = F,
annotation_col = annotation_col,
annotation_row = annotation_row)
```
Boxplot is commonly used to summarize the results:
```{r cobiclust_4}
if(!require(ggplot2)){
install.packages("ggplot2")
library(ggplot2)
}
if(!require(patchwork)){
install.packages("patchwork")
library(patchwork)
}
# ggplot requires data in melted format
melt_assay <- meltAssay(mae[[1]], assay_name = "rclr", add_col_data = T, add_row_data = T)
# patchwork two plots side-by-side
p1 <- ggplot(melt_assay) +
geom_boxplot(aes(x = clusters.x, y = rclr)) +
labs(x = "Taxa clusters")
p2 <- ggplot(melt_assay) +
geom_boxplot(aes(x = clusters.y, y = rclr)) +
labs(x = "Sample clusters")
p1 + p2
```
## Taxa vs biomolecules
Here, we analyze cross-correlation between taxa and metabolites. This is a case, where
we use _biclust_ method which is suitable for numeric matrices in general.
```{r biclust_1}
# Samples must be in equal order
# (Only 1st experiment was ordered in cobiclust step leading to unequal order)
mae[[1]] <- mae[[1]][ , colnames(mae[[2]]) ]
# Make rownames unique since it is require by other steps
rownames(mae[[1]]) <- make.unique(rownames(mae[[1]]))
# Calculate correlations
corr <- getExperimentCrossCorrelation(mae, 1, 2,
assay_name1 = "rclr",
assay_name2 = "nmr",
mode = "matrix",
cor_threshold = 0.2)
```
_biclust_ takes matrix as an input and returns _biclust_ object.
```{r biclust_2}
# Load package
if(!require(biclust)){
install.packages("biclust")
library(biclust)
}
# Set seed for reproducibility
set.seed(3973)
# Find biclusters
bc <- biclust(corr, method=BCPlaid(), fit.model = y ~ m,
background = TRUE, shuffle = 100, back.fit = 0, max.layers = 10,
iter.startup = 10, iter.layer = 100, verbose = FALSE)
bc
```
The object includes cluster information. However compared to _cobiclust_,
_biclust_ object includes only information about clusters that were found, not general cluster.
Meaning that if one cluster size of 5 features was found out of 20 features,
those 15 features do not belong to any cluster. That is why we have to create an
additional cluster for features/samples that are not assigned into any cluster.
```{r biclust_3}
# Functions for obtaining biclust information
# Get clusters for rows and columns
.get_biclusters_from_biclust <- function(bc, assay){
# Get cluster information for columns and rows
bc_columns <- t(bc@NumberxCol)
bc_columns <- data.frame(bc_columns)
bc_rows <- bc@RowxNumber
bc_rows <- data.frame(bc_rows)
# Get data into right format
bc_columns <- .manipulate_bc_data(bc_columns, assay, "col")
bc_rows <- .manipulate_bc_data(bc_rows, assay, "row")
return(list(bc_columns = bc_columns, bc_rows = bc_rows))
}
# Input clusters, and how many observations there should be, i.e., the number of samples or features
.manipulate_bc_data <- function(bc_clusters, assay, row_col){
# Get right dimension
dim <- ifelse(row_col == "col", ncol(assay), nrow(assay))
# Get column/row names
if( row_col == "col" ){
names <- colnames(assay)
} else{
names <- rownames(assay)
}
# If no clusters were found, create one. Otherwise create additional cluster which
# contain those samples that are not included in clusters that were found.
if( nrow(bc_clusters) != dim ){
bc_clusters <- data.frame(cluster = rep(TRUE, dim))
} else {
# Create additional cluster that includes those samples/features that
# are not included in other clusters.
vec <- ifelse(rowSums(bc_clusters) > 0, FALSE, TRUE)
# If additional cluster contains samples, then add it
if ( any(vec) ){
bc_clusters <- cbind(bc_clusters, vec)
}
}
# Adjust row and column names
rownames(bc_clusters) <- names
colnames(bc_clusters) <- paste0("cluster_", 1:ncol(bc_clusters))
return(bc_clusters)
}
```
```{r biclust_4}
# Get biclusters
bcs <- .get_biclusters_from_biclust(bc, corr)
bicluster_rows <- bcs$bc_rows
bicluster_columns <- bcs$bc_columns
# Print biclusters for rows
head(bicluster_rows)
```
Let's collect information for the scatter plot.
```{r biclust_5}
# Function for obtaining sample-wise sum, mean, median, and mean variance for each cluster
.sum_mean_median_var <- function(tse1, tse2, assay_name1, assay_name2, clusters1, clusters2){
list <- list()
# Create a data frame that includes all the information
for(i in 1:ncol(clusters1) ){
# Subset data based on cluster
tse_subset1 <- tse1[clusters1[,i], ]
tse_subset2 <- tse2[clusters2[,i], ]
# Get assay
assay1 <- assay(tse_subset1, assay_name1)
assay2 <- assay(tse_subset2, assay_name2)
# Calculate sum, mean, median, and mean variance
sum1 <- colSums2(assay1, na.rm = T)
mean1 <- colMeans2(assay1, na.rm = T)
median1 <- colMedians(assay1, na.rm = T)
var1 <- colVars(assay1, na.rm = T)
sum2 <- colSums2(assay2, na.rm = T)
mean2 <- colMeans2(assay2, na.rm = T)
median2 <- colMedians(assay2, na.rm = T)
var2 <- colVars(assay2, na.rm = T)
list[[i]] <- data.frame(sample = colnames(tse1), sum1, sum2, mean1, mean2,
median1, median2, var1, var2)
}
return(list)
}
# Calculate info
df <- .sum_mean_median_var(mae[[1]], mae[[2]], "rclr", "nmr", bicluster_rows, bicluster_columns)
```
Now we can create a scatter plot. X-axis includes median clr abundance of microbiome
and y-axis median absolute concentration of each metabolite. Each data point represents
a single sample.
From the plots, we can see that there is low negative correlation in both cluster 1 and 3.
This means that when abundance of bacteria belonging to cluster 1 or 3 is higher,
the concentration of metabolites of cluster 1 or 3 is lower, and vice versa.
```{r biclust_6, fig.width=14, fig.height=6, fig.show="keep", out.width="33%"}
pics <- list()
for(i in seq_along(df)){
pics[[i]] <- ggplot(df[[i]]) +
geom_point(aes(x = median1, y = median2)) +
labs(title = paste0("Cluster ", i),
x = "Taxa (rclr median)",
y = "Metabolites (abs. median)")
print(pics[[i]])
}
# pics[[1]] + pics[[2]] + pics[[3]]
```
_pheatmap_ does not allow boolean values, so they must be converted into factors.
```{r biclust_7}
bicluster_columns <- data.frame(apply(bicluster_columns, 2, as.factor))
bicluster_rows <- data.frame(apply(bicluster_rows, 2, as.factor))
```
Again, we can plot clusters with heatmap.
```{r biclust_8, fig.width=10, fig.height=10}
# Adjust colors for all clusters
if( ncol(bicluster_rows) > ncol(bicluster_columns) ){
cluster_names <- colnames(bicluster_rows)
} else {
cluster_names <- colnames(bicluster_columns)
}
annotation_colors <- list()
for(name in cluster_names){
annotation_colors[[name]] <- c("TRUE" = "red", "FALSE" = "white")
}
# Create a heatmap
pheatmap(corr, cluster_cols = F, cluster_rows = F,
annotation_col = bicluster_columns,
annotation_row = bicluster_rows,
annotation_colors = annotation_colors)
```
## Taxa vs taxa
Third and final example deals with situation where we want to analyze correlation
between taxa. _biclust_ is suitable for this.
```{r biclust_9}
# Calculate cross-correlation
corr <- getExperimentCrossCorrelation(mae, 1, 1,
assay_name1 = "rclr", assay_name2 = "rclr",
mode = "matrix",
cor_threshold = 0.2, verbose = F, show_warning = F)
# Find biclusters
bc <- biclust(corr, method=BCPlaid(), fit.model = y ~ m,
background = TRUE, shuffle = 100, back.fit = 0, max.layers = 10,
iter.startup = 10, iter.layer = 100, verbose = FALSE)
```
```{r biclust_10}
# Get biclusters
bcs <- .get_biclusters_from_biclust(bc, corr)
bicluster_rows <- bcs$bc_rows
bicluster_columns <- bcs$bc_columns
```
```{r biclust_11}
# Create a column that combines information
# If row/column includes in multiple clusters, cluster numbers are separated with "_&_"
bicluster_columns$clusters <- apply(bicluster_columns, 1,
function(x){paste(paste(which(x)), collapse = "_&_") })
bicluster_columns <- bicluster_columns[, "clusters", drop = FALSE]
bicluster_rows$clusters <- apply(bicluster_rows, 1,
function(x){paste(paste(which(x)), collapse = "_&_") })
bicluster_rows <- bicluster_rows[, "clusters", drop = FALSE]
```
```{r biclust_12, fig.width=14, fig.height=12}
# Convert boolean values into factor
bicluster_columns <- data.frame(apply(bicluster_columns, 2, as.factor))
bicluster_rows <- data.frame(apply(bicluster_rows, 2, as.factor))
pheatmap(corr, cluster_cols = F, cluster_rows = F,
annotation_col = bicluster_columns,
annotation_row = bicluster_rows)
```
## Session Info {-}
```{r sessionInfo, echo=FALSE, results='asis'}
prettySessionInfo()
```