analysis/MashLowSignalGTEx3.5.Rmd

---
title: "MASH Null -- Real Data (GTEx random set)"
author: "Yuxin Zou"
date: 2018-08-29
output: 
  workflowr::wflow_html:
    code_folding: hide
---

```{r}
library(mashr)
library(knitr)
library(kableExtra)
```

There are two random sets in the [GTEx summary data set](https://github.com/stephenslab/gtexresults/blob/master/data/MatrixEQTLSumStats.Portable.Z.rds?raw=TRUE). We don't know the null in the real data. If we have the individual level data, we can do a permutation to generate null. With the summary statistics, we select the null set using threshold on z scores.

Using qvalues 0.05 as the threshold, the corresponding non-significant |z| values are less than 3.5. We select the samples with $\max_{r} |Z_{jr}| < 3.5$ from each data set as the null set. We estimate data driven covariance matrices from data 1, estimate noise correlation from data 2, fit mash model on data 2 and calculate posterior on data 1.

```{r}
data = readRDS('../output/GTEx_3.5_nullData.rds')
model = readRDS('../output/GTEx_3.5_nullModel.rds')
```

Sample size:
```{r}
samplesize = matrix(c(nrow(data$m.data1.null$Bhat), nrow(data$m.data2.null$Bhat)))
row.names(samplesize) = c('data 1', 'data 2')
samplesize %>% kable() %>% kable_styling()
```

The estimated weights from data 2 is
```{r}
barplot(get_estimated_pi(model), las=2, cex.names = 0.7)
```

The estimated weights $\hat{\pi}$ on null part is not large. The weight on the other covariance structures may concentrate on the small grid (small $\omega_{l}$). So they are very close to null, but we cannot view it in the plot. I modified the `get_estimated_pi` function to have a threshold for grid. The weights on the grid less than the threshold are merged into the null part.

```{r}
get_estimated_pi = function(m, dimension = c("cov", "grid", "all"), thres = NULL){
  dimension = match.arg(dimension)
  if (dimension == "all") {
      get_estimated_pi_no_collapse(m)
  }
  else {
      g = get_fitted_g(m)
      pihat = g$pi
      pihat_names = NULL
      pi_null = NULL
      if (g$usepointmass) {
        pihat_names = c("null", pihat_names)
        pi_null = pihat[1]
        pihat = pihat[-1]
      }
      pihat = matrix(pihat, nrow = length(g$Ulist))
      if(!is.null(thres)){
        pi_null = sum(pihat[, g$grid <= thres]) + pi_null
        pihat = pihat[, g$grid > thres]
      }
      if (dimension == "cov"){
        pihat = rowSums(pihat)
        pihat_names = c(pihat_names, names(g$Ulist))
      }
      else if (dimension == "grid") {
        pihat = colSums(pihat)
        pihat_names = c(pihat_names, 1:length(g$grid))
      }
      pihat = c(pi_null, pihat)
      names(pihat) = pihat_names
      return(pihat)
  }
}
```

```{r}
barplot(get_estimated_pi(model, thres = 0.01), las=2, cex.names = 0.7, main='Estimated pi with threshold (0.01) in grid')
```

The correlation for the `ED_tPCA` is
```{r}
corrplot::corrplot(cov2cor(model$fitted_g$Ulist[['ED_tPCA']]))
```

There are `r length(get_significant_results(model))` significant samples in data 1.

## Permute samples in each condition to break the sharing

`mash` increases power, because it considers the sharing among conditions. We permute samples in each condition to break the sharing.

```{r}
data = readRDS('../output/GTEx_3.5_nullPermData.rds')
model = readRDS('../output/GTEx_3.5_nullPermModel.rds')
```

Sample size:
```{r}
samplesize = matrix(c(nrow(data$m.data1.p.null$Bhat), nrow(data$m.data2.p.null$Bhat)))
row.names(samplesize) = c('data 1', 'data 2')
samplesize %>% kable() %>% kable_styling()
```

The estimated weights from data 2 is
```{r}
barplot(get_estimated_pi(model), las=2, cex.names = 0.7, main = 'Estiamted pi')
```

```{r}
barplot(get_estimated_pi(model, thres = 0.01), las=2, cex.names = 0.7, main='Estimated pi with threshold (0.01) in grid')
```
There are `r length(get_significant_results(model))` significant samples in data 1.

There is no overfitting issue.