-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMashLowSignalGTEx3.5.Rmd
112 lines (94 loc) · 3.83 KB
/
MashLowSignalGTEx3.5.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
title: "MASH Null -- Real Data (GTEx random set)"
author: "Yuxin Zou"
date: 2018-08-29
output:
workflowr::wflow_html:
code_folding: hide
---
```{r}
library(mashr)
library(knitr)
library(kableExtra)
```
There are two random sets in the [GTEx summary data set](https://github.com/stephenslab/gtexresults/blob/master/data/MatrixEQTLSumStats.Portable.Z.rds?raw=TRUE). We don't know the null in the real data. If we have the individual level data, we can do a permutation to generate null. With the summary statistics, we select the null set using threshold on z scores.
Using qvalues 0.05 as the threshold, the corresponding non-significant |z| values are less than 3.5. We select the samples with $\max_{r} |Z_{jr}| < 3.5$ from each data set as the null set. We estimate data driven covariance matrices from data 1, estimate noise correlation from data 2, fit mash model on data 2 and calculate posterior on data 1.
```{r}
data = readRDS('../output/GTEx_3.5_nullData.rds')
model = readRDS('../output/GTEx_3.5_nullModel.rds')
```
Sample size:
```{r}
samplesize = matrix(c(nrow(data$m.data1.null$Bhat), nrow(data$m.data2.null$Bhat)))
row.names(samplesize) = c('data 1', 'data 2')
samplesize %>% kable() %>% kable_styling()
```
The estimated weights from data 2 is
```{r}
barplot(get_estimated_pi(model), las=2, cex.names = 0.7)
```
The estimated weights $\hat{\pi}$ on null part is not large. The weight on the other covariance structures may concentrate on the small grid (small $\omega_{l}$). So they are very close to null, but we cannot view it in the plot. I modified the `get_estimated_pi` function to have a threshold for grid. The weights on the grid less than the threshold are merged into the null part.
```{r}
get_estimated_pi = function(m, dimension = c("cov", "grid", "all"), thres = NULL){
dimension = match.arg(dimension)
if (dimension == "all") {
get_estimated_pi_no_collapse(m)
}
else {
g = get_fitted_g(m)
pihat = g$pi
pihat_names = NULL
pi_null = NULL
if (g$usepointmass) {
pihat_names = c("null", pihat_names)
pi_null = pihat[1]
pihat = pihat[-1]
}
pihat = matrix(pihat, nrow = length(g$Ulist))
if(!is.null(thres)){
pi_null = sum(pihat[, g$grid <= thres]) + pi_null
pihat = pihat[, g$grid > thres]
}
if (dimension == "cov"){
pihat = rowSums(pihat)
pihat_names = c(pihat_names, names(g$Ulist))
}
else if (dimension == "grid") {
pihat = colSums(pihat)
pihat_names = c(pihat_names, 1:length(g$grid))
}
pihat = c(pi_null, pihat)
names(pihat) = pihat_names
return(pihat)
}
}
```
```{r}
barplot(get_estimated_pi(model, thres = 0.01), las=2, cex.names = 0.7, main='Estimated pi with threshold (0.01) in grid')
```
The correlation for the `ED_tPCA` is
```{r}
corrplot::corrplot(cov2cor(model$fitted_g$Ulist[['ED_tPCA']]))
```
There are `r length(get_significant_results(model))` significant samples in data 1.
## Permute samples in each condition to break the sharing
`mash` increases power, because it considers the sharing among conditions. We permute samples in each condition to break the sharing.
```{r}
data = readRDS('../output/GTEx_3.5_nullPermData.rds')
model = readRDS('../output/GTEx_3.5_nullPermModel.rds')
```
Sample size:
```{r}
samplesize = matrix(c(nrow(data$m.data1.p.null$Bhat), nrow(data$m.data2.p.null$Bhat)))
row.names(samplesize) = c('data 1', 'data 2')
samplesize %>% kable() %>% kable_styling()
```
The estimated weights from data 2 is
```{r}
barplot(get_estimated_pi(model), las=2, cex.names = 0.7, main = 'Estiamted pi')
```
```{r}
barplot(get_estimated_pi(model, thres = 0.01), las=2, cex.names = 0.7, main='Estimated pi with threshold (0.01) in grid')
```
There are `r length(get_significant_results(model))` significant samples in data 1.
There is no overfitting issue.