Cardio is the Linux clusters once used by the Cardiovascular Epidemiology Unit at University of Camrbridge.
/scratch/jhz22/data/INTERVAL
location: /scratch/jhz22/INF/sumstats/INTERVAL
location: /scratch/jhz22/INF/plots
location: /scratch/jhz22/INF/METAL
See https://bitbucket.org/nygcresearch/ldetect-data
location, /scratch/jhz22/INF/aild
- cojo/, GCTA --cojo-slct results for all (INF1*) and individual proteins
- ps/, PhenoScanner v2 results. coded as in /scratch/jhz22/INF/cardio/ps.sh
- clump/, PLINK --clump results for all (INF1*) and individual proteins
location, /scratch/jhz22/INF/snps
- cojo/, GCTA --cojo-slct results for all (INF1*) and individual proteins
- ps/, PhenoScanner v2 results.
- lz/, LocusZoom plots
- clump/, PLINK --clump results for all (INF1*) and individual proteins
- clump0/ PLINK --clump-r2 0 for independent signals
location: /scratch/jhz22/INF/clumping
Files: study.reference.LD-cutoff.clumped_reults, e.g., INTERVAL.UK10K+1KG.r2-0.1.clumped
location: /scratch/jhz22/INF/cojo
location: /scratch/jhz22/INF/aild-ld
- aild-indel/, all variants
- ps, PhenoScanner v2 and v1.1/
- aild-snp/, SNPs only
Results from PhenoScanner which contains protein-specific and whole (INF1_* and INF1r_*, see below) results.
# those INF1_*
# All results, no LD, p=1e-5
# phenoscanner -c All -l No -p 0.00001 -i INF1.ps -o INF1
# revised
# All results, no LD, p=1e-7
phenoscanner -c All -l No -p 0.0000001 -i INF1.ps -o INF1r
The results on GWAS with LD r2=0.6 in INF1r_* are available from R as follows,
R --no-save -q <<END
library(MendelianRandomization)
# SNPID not possible
INF1 <- read.table("INF1.ps",as.is=TRUE)
# batches in 100 only
INF1 <- with(read.csv("INF1_PhenoScanner_SNP_Info.csv"),rsID)
r1 <- phenoscanner(snpquery= INF1[1:100], proxies = "EUR", pvalue = 1e-07, r2= 0.6, build=37)
r2 <- phenoscanner(snpquery=INF1[101:200], proxies = "EUR", pvalue = 1e-07, r2= 0.6, build=37)
r3 <- phenoscanner(snpquery=INF1[201:300], proxies = "EUR", pvalue = 1e-07, r2= 0.6, build=37)
r4 <- phenoscanner(snpquery=INF1[301:376], proxies = "EUR", pvalue = 1e-07, r2= 0.6, build=37)
save(r1,r2,r3,r4,file="INF1r.rda",version=2)
END
- Set up 1703 autosomal regions as defined in EURLD.bed.
- Extract variants outside the regions in high LD to 1672 regions by EURLD.sh.
- Overlap regions and GWAS sumstats:
- Tag GWAS sumstats with regions through aild-rma.sb.
- Pair protein-region which contains genomewide significant signals by aild-list.sb.
- Independently, list variants by region in the reference panel by aild-snplist.sb.
- clump via aild-clump.sb.
- cojo via aild-cojo.sb.
- Downstream analyses with PhenoScanner (preferably v2) as in ps.sh and forest plots on TRYGGVE (with study-specific sumstats).
The regions are predefined. As shown in EURLD.tsv by EURLD.R, the LD patterns across the genome are more variable than the norm in a typical genomewide association analysis therefore slide windows such as 250kb (36), 500kb (300), or even 10Mb (1071), seeing that the sentinel variant may not necessarily lie right in the middle of a window. The number of signals in our case were close to GCTA but overestimated (53 by PLINK) as in the following table. For instance, it is often seen from the PLINK --clump-range outputs that sliding windows can give results in two neighbouring LD blocks.
Note that pairing regions of interest would reduce the burden of genomewide analysis, and also that region-specific reference will not affect results from steps 4 and 5 regarding use of variants from GWAS sumstats.
Steps 4 and 5 both use INF1.aild
, which contains all the protein-region pairs. The results are classified as in analysis.sh. In particular, for step 5 this is done with aild.sh.
Run | Option | cis | trans | total | Comments/location+ |
---|---|---|---|---|---|
GCTA | |||||
1 | LD blocks | 210 | 147 | 357 | only SNPs, cojo/aild-snp/INF1.jma.*, also doc/INF1.paper.xlsx |
+ indels | LD blocks | 220 | 155 | 375 | SNPs+indels, cojo/aild-indel/INF1.jma.* |
2 | default | 234 | 173 | 407 | --cojo-collinear 0.9 --cojo-wind 10000, doc/SCALLOP_INF1-260419.xlsx |
3 | small R2 & window | 189 | 186 | 375 | --cojo-collinear 0.1 --cojo-wind 500, doc/SCALLOP_INF1-260419.xlsx |
PLINK | |||||
4 | LD blocks | 594 | 252 | 846 | only SNPs, clumping/aild-snp/INF1.jma.*, also doc/INF1.paper.xlsx |
+ indels | LD blocks | 621 | 258 | 879 | SNPs+indels, clumping/aild-indel/INF1.jma.* |
5 | INTERVAL LD panel | 657 | 275 | 932 | --clump-r2 0.1 --clump-kb 500, doc/SCALLOP_INF1-120419.xlsx |
6 | 1000G LD panel | 405 | 229 | 634 | --clump-r2 0.1 --clump-kb 500, clumping/INF1.1KG.r2-0.1.clumped.* |
7 | INTERVAL data | 424 | 188 | 612 | --clump-r2 0.1 --clump-kb 500, doc/SCALLOP_INF1-120419.xlsx |
8 | 1000G LD panel | 402 | 226 | 628 | --clump-r2 0.1 --clump-kb 1000, on tryggve |
+The directories are relative to /scratch/jhz22/INF, i.e., doc/, cojo/ and clumping/, Results in 2 and 3 include regions in high LD excluded in other analyses.
A few observations can be made,
- indels lead to more signals in cojo (1) and clumping (4) analyses.
- default GCTA --cojo-collinear and --cojo-wind parameters did quite well in numbers, esp. taking ~30 away regions in LD (1, 2).
- Although it looks close, GCTA --cojo-collinear 0.1 produces considerably less signals compared to --cojo-collinear 0.9 (3).
- the number of signals increase with the values of GCTA parameters (2, 3), yet moderate changes in LD window have less impact than the reference panel (5, 8).
- PLINK --clump gives more signals than GCTA --cojo (1, 4 and 2, 5).
- Specification of sliding LD windows disregarding AILD patterns in clumping gives 53 additional signals (4, 5).
- Thanks to the larger sample size and perhaps greater variant number, INTERVAL as LD reference leads to more signals than 1000Genomes (5, 6).
- Summary statistics from larger sample size gives more signals (5, 7).
- Unpruned results are likely to give more cis signals but this is subject to scrutiny perhaps on individual cases.
It can be concluded that it is desirable to employ approximately independent LD blocks for both GCTA (1) and PLINK (4), and also that reference such as UK10K+1KG would be desirable with respect to both sample size and variant number.
Date last changed: 23/8/2021