forked from jean997/GFA
-
Notifications
You must be signed in to change notification settings - Fork 0
/
sample_cor_ld_sims.Rmd
224 lines (183 loc) · 9.62 KB
/
sample_cor_ld_sims.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
title: "Simulations studying effects of sample overlap, LD, and variant selection"
author: "Jean Morrison"
date: "2020-04-03"
output: workflowr::wflow_html
editor_options:
chunk_output_type: console
---
## Introduction
This document summarizes some simulatione I conducted looking at three issues:
1. Methods for correcting for sample overlap
2. Methods for accounting for LD
3. The effect of variant selection
The model:
$$
\hat{B}_{J\times M} = L_{K \times J}^{T}F_{K\times M} + \Theta_{J\times M} + E_{J\times M}
$$
throughout we will work with $z$-scores instead of effect estimates and assume that the low rank structure of $\hat{Z}$ is the same as the low-rank structure of $\hat{B}$, so the model is
$$
\hat{Z}_{J\times M} = L_{K \times J}^{T}F_{K\times M} + \Theta_{J\times M} + E_{J\times M}
$$
## Sample Overlap
If two traits are measured in the same samples or overlapping samples, then effect estimates will be correlated if the traits are correlated due to non-genetic factors, leading to correlation in the rows of $E$. I wrote about this [previously](simulations2.html) and tested a method of correcting in which we post-multiply the matrix of effect estimates by eigenvectors of $R$, the row correlation of $E$. This works in our simulations so far but has a few problems:
a) we will estimate $FU$ instesad of $F$ which means we are assuming $FU$ is sparse. This may not be true and we lose the advantage of "knowing" that $F$ is sparse
b) The elements of $\Theta U$ are not independent even if the elements of $\Theta$ are. I have not run simulations with $\Theta$ yet so this has not caused a problem but it could.
There is an alternative approach in which we fit the problem with fixed factors representing the eigenvectors of $R$. This strategy is based on ideas from Matthew and Jason [here](https://willwerscheid.github.io/FLASHvestigations/arbitraryV.html). Briefly,
let $\lambda_{min}$ be the smallest eigenvalue of $R$ and let $W = R - \lambda_{min}I$. Let $v_1, \dots, v_M$ be eigenvectors of $W$ and $a_1, \dots, a_M$ eigenvalues of $W$.
Then we can decompose $E$ as
$$
E = \sum_{k=1}^n v_k t_k^T + E^{\prime}
$$
where the elements of $E^{\prime}$ are iid $E^{\prime}_{ij} \sim N(0, \lambda_{min})$ ant $t_k$ are $J\times 1$ vectors with iid elements $t_ki \sim N(0, a_k)$.
This means that we can fit the model by including $M$ fixed factors $v_1, \dots, v_M$ and specifying the prior for the associated loadings.
This first set of simulations explores 1) how well does the fixed factor method work? and 2) Can we estimate $R$ from null variants. These simulations do not include $\Theta$ and $\hat{B}$ are generated directly from the model.
Simulations:
+ F 20 × 5 fixed over all simulations
+ 10k variants (∼72% null for all traits)
+ R either blocks or a fixed randomly generated correlation matrix
Estimation:
+ R known or estimated from variants with large p-values
+ Naive or eigenvector corrected fit
```{r}
library(tidyverse)
res <- readRDS("analysis_data/2020-04-28_three_traitcor.RDS")
res$ebmf.thresh[res$ebmf=="flashier_corrected2"] <- 1
plt1 <- res %>%
filter(eval=="rrmse" & ebmf.thresh==1 & simulate.which_R !="identity" ) %>%
mutate(fit = str_replace(ebmf, "fit_", "") %>%
str_replace(., "flashier_", "") %>%
paste0(., "_", ebmf.R_thresh)) %>%
#filter(fit != "corrected_0.1") %>%
mutate(fit = recode(fit, "corrected_0.05" = "EV 0.05",
"corrected_0.1" = "EV 0.1",
"corrected2_0.05" = "FF 0.05",
"corrected2_0.1" = "FF 0.1",
"corrected_oracle" = "EV oracle",
"corrected2_oracle" = "FF oracle",
"naive_NA" = "naive")) %>%
ggplot(.) +
geom_boxplot(aes(y=eval.value, x=fit, group=fit,
col=fit)) +
scale_color_discrete(guide=FALSE) +
ylab("RRMSE") +
facet_wrap(~simulate.which_R) +
theme_bw() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(size=14, angle=90),
axis.title.y = element_text(size=16),
axis.text = element_text(size=14),
strip.text = element_text(size=16))
plt1
plt2 <- res %>%
filter(eval=="factor_recovery" & ebmf.thresh==1 & simulate.which_R !="identity" ) %>%
mutate(fit = str_replace(ebmf, "fit_", "") %>%
str_replace(., "flashier_", "") %>%
paste0(., "_", ebmf.R_thresh)) %>%
#filter(fit != "corrected_0.1") %>%
mutate(fit = recode(fit, "corrected_0.05" = "EV 0.05",
"corrected_0.1" = "EV 0.1",
"corrected2_0.05" = "FF 0.05",
"corrected2_0.1" = "FF 0.1",
"corrected_oracle" = "EV oracle",
"corrected2_oracle" = "FF oracle",
"naive_NA" = "naive")) %>%
ggplot(.) +
geom_boxplot(aes(y=eval.value, x=fit, group=fit,
col=fit)) +
scale_color_discrete(guide=FALSE) +
ylab("Best Factor Correlation") +
theme(axis.title.x = element_blank(),
axis.title.y = element_text(size=16),
axis.text.x = element_text(size=14, angle=90),
axis.text.y = element_text(size=14),
strip.text = element_text(size=14)) +
facet_wrap(~simulate.which_R*eval.f, ncol=5)
plt2
```
The eigenvector correction (EV) is working as expected but the fixed factor correction (FF) is not working at all.
## LD
Another issue is linkage disequilibrium (non-indpendence between variants). This induces correlation in the columns of $E$ and changes the expected value of $\hat{B}_{m,j}$. Note that sample overlap only induces correlation. If variants have correlation $A$ then
$$
\hat{Z}_{\bullet, m} \sim N(AZ_{\bullet, m}, A)
$$
It may be that if we are mainly interested in recovering $F$, LD doesn't have much of an effect on our estimates and we can simply LD prune. Here I tested that hypothesis in simple simulations. In these simulations:
+ $F$ is fixed over all simulations and is $20\times 5$. (20 traits, 5 hidden factors)
+ There are 1,000 LD blocks, each containing exactly one effect variant.
+ The expected number of variants per LD block is 5. In one setting "5 per block", there are exactly 5 variants in every block. In the setting "random block size", the number of variants is drawn from an exponential distribution with mean 5 (and then rounded to the nearest integer).
Because there is exactly one effect variant per block there is a natural oracle or best case soclution in which you simply estimate $F$ using only true effect variants. I compared this with a strategy of using all variants and a strategy of choosing one variant per block, selecting the variant with the lowest minimum $p$-value across all traits.
```{r}
res <- readRDS("analysis_data/2020-04-01_one.RDS")
res <- res %>%
mutate(ebmf = recode(ebmf, "flashier_all" = "all",
"flashier_effect" = "oracle",
"flashier_top" = "ld prune"),
ebmf = factor(ebmf, levels=c("oracle", "all", "ld prune")),
simulate.block_size = recode(simulate.block_size, "fixed" = "5 per block",
"rexp" = "random block size"))
### RRMSE
plt1 <- res %>%
filter(eval=="rrmse_imp" ) %>%
ggplot(.) +
geom_boxplot(aes(y=eval.value, x=ebmf, group=ebmf,
col=ebmf)) +
scale_color_discrete(guide=FALSE) +
facet_wrap(~simulate.block_size) +
ylab("RRMSE") +
theme_bw() +
theme(axis.title.x = element_blank(),
axis.title.y = element_text(size=16),
axis.text = element_text(size=14),
strip.text = element_text(size=16))
plt1
```
```{r, plot_fct}
plt2 <- res %>%
filter(eval=="factor_recovery" ) %>%
ggplot(.) +
geom_boxplot(aes(y=eval.value, x=ebmf, group=ebmf,
col=ebmf)) +
scale_color_discrete(guide=FALSE) +
ylab("Best Factor Correlation") +
facet_wrap(~simulate.block_size*eval.f, ncol=5)+
theme(axis.title.x = element_blank(),
axis.text.x = element_text(size=14, angle=90),
axis.title.y = element_text(size=16),
axis.text = element_text(size=14),
strip.text = element_text(size=10))
plt2
```
## Variant Selection
Finally, the issue of variant selection. In these simulations I fit $L$ and $F$ using only variants that have cross-trait minimum $p$-value below some threshold. These simulations are the same as the sample overlap simulations except that there is no row correlation in $E$.
```{r}
res <- readRDS("analysis_data/2020-04-03_one_varselect.RDS")
plt1 <- res %>%
filter(eval=="rrmse" & simulate.which_R =="identity" & ebmf=="fit_naive" ) %>%
mutate(ebmf.thresh = factor(ebmf.thresh, levels =c(1, 0.01,1e-3, 1e-4))) %>%
ggplot(.) +
geom_boxplot(aes(y=eval.value, x=ebmf.thresh, group=ebmf.thresh,
col=ebmf.thresh)) +
scale_color_discrete(guide=FALSE) +
ylab("RRMSE") + xlab("Threshold") +
theme_bw() +
theme(axis.title = element_text(size=16),
axis.text = element_text(size=14),
strip.text = element_text(size=16))
plt1
plt2 <- res %>%
filter(eval=="factor_recovery" & simulate.which_R =="identity" & ebmf=="fit_naive" ) %>%
mutate(ebmf.thresh = factor(ebmf.thresh, levels =c(1, 0.01,1e-3, 1e-4))) %>%
ggplot(.) +
geom_boxplot(aes(y=eval.value, x=ebmf.thresh, group=ebmf.thresh,
col=ebmf.thresh)) +
scale_color_discrete(guide=FALSE) +
ylab("Best Factor Correlation") + xlab("Threshold") +
theme_bw()+
theme(axis.title.x = element_blank(),
axis.title.y = element_text(size=16),
axis.text.x = element_text(size=14, angle=90),
axis.text.y = element_text(size=14),
strip.text = element_text(size=14)) +
facet_wrap(~eval.f, ncol=5)
plt2
```