Table of Contents
rareGWAMA is a flexible software package for imputation based GWAS meta-analysis.
It is developed and maintained by Dajiang Liu's Group.
Liu DJ*†, Peloso GM*, Zhan X*, Holmen O*, Zawistowski M, Feng S, Nikpay M, Auer PL, Goel A, Zhang H, Peters U, Farrall M, Orho-Melander M, Kooperberg C, McPherson R, Watkins H, Willer CJ, Hveem, K, Melander O, Kathiresan S, Abecasis GR†
Meta-analysis of gene-level tests of rare variant association, Nature Genetics, 46, 200–204 (2014)
doi: 10.1038/ng.2852.
The package is hosted on github, which allows installation and update to be very easy. First, make sure you have the mvtnorm
and data.table
packages installed:
install.packages("devtools")
And also, you need the latest version of seqminer:
library(devtools)
devtools::install_github("zhanxw/seqminer")
Then you could use:
install_github("dajiangliu/rareGWAMA")
With library(rareGWAMA)
, your are ready to go!
1.The very basic test is using:
res <- rareGWAMA.single(score.stat.file=study, imp.qual.file =imp.qual, tabix.range="1:11000-58000", alternative="two.sided", col.impqual=5, impQual.lb=0, impQualWeight=FALSE, weight="Npq+impQ",gc=FALSE, rmMultiAllelicSite=TRUE);
please find more details in the input and arguments part for the arguments:
- score.stat.file: The file names of score statistic files, which could be a vector object;
- imp.qual.file: The file names of imputation quality, which could be a vector object;
- tabix.range: The tabix range, which must be in quote and provided as a string like this: "1:11000-58000";
- alternative: The alternative hypothesis. Default is two.sided;
- col.impqual: The column number for the imputation quality score;
- impQual.lb: The lower bound for the imputation quality. Variants with imputaiton quality less than impQual.lb will be labelled as missing;
- impQualWeight: Using imputation quality as weight;
- rmMultiAllelicSite: Default is TRUE. Multi-allelic sites will be removed from the analyses if set TRUE, and a variable posMulti will be output; The variant site with multiple alleles can be analyzed using rareGWAMA.single.multiAllele function;
2.The out put should be as follows:
head(res$res.formatted))
POS REF ALT AF STAT PVALUE BETA SD N DIRECTION EFFECTIVE_N
[1,] "9:100000040" "G" "A" "2.28e-05" "1.01e+00" "3.16e-01" " 7.29e-01" " 0.72624" "41634" "XXXXXXXXXXXXXXXXXXXXXXXXXXXX++XX" "7925"
[2,] "9:100000056" "T" "TA" "7.07e-01" "1.62e+00" "2.04e-01" "-9.09e-03" " 0.00715" "47219" "XXXXXXXXXXXXXXXX-XXXXXXXXXXX--XX" "46951"
[3,] "9:100000102" "G" "C" "6.91e-04" "8.00e-01" "3.71e-01" "-7.78e-02" " 0.08691" "95869" "+++---+-------++XX-++--X++----XX" "36378"
[4,] "9:100000172" "C" "T" "2.26e-05" "2.25e-01" "6.35e-01" "-1.99e-01" " 0.41982" "125656" "XXXXXXXXXXXXXXXX-XXXXXXXXXXX--X+" "20434"
[5,] "9:100000177" "C" "G" "2.16e-04" "8.38e-01" "3.60e-01" "-1.04e-01" " 0.11335" "179891" "-+---X+XX-+--+-+-X-+-+-X++-+--X+" "89179"
[6,] "9:100000187" "A" "T" "1.13e-04" "6.78e-01" "4.10e-01" "-2.35e-01" " 0.28514" "54235" "+-++-XX+--++--+-XX+++++X----XXXX" "21285"
- For demo data, please see
?rareGWAMA.single
.
1.The command should be like:
res <- rareGWAMA.cond.single(score.stat.file=study, imp.qual.file=imp.qual, vcf.ref.file="{$your_path}/ALL.chr9.phase3.genotypes.vcf.gz", candidateVar="9:97018619", knownVar="9:100000172", alternative="two.sided", col.impqual=5, impQual.lb=0, impQualWeight=FALSE, weight="Npq+impQ", gc=FALSE, rmMultiAllelicSite=TRUE);
please find more details in the input and arguments part for the arguments:
- score.stat.file: The file names of score statistic files, which could be a vector object;
- imp.qual.file: The file names of imputation quality, which could be a vector object;
- vcf.ref.file: the file names of the reference panel file, e.g. could be downloaded from 1000 Genomes Project.
- candidateVar: The tabix range;
- knownVar: The known variant;
- alternative: The alternative hypothesis. Default is two.sided;
- col.impqual: The column number for the imputation quality score;
- impQual.lb: The lower bound for the imputation quality. Variants with imputaiton quality less than impQual.lb will be labelled as missing;
- impQualWeight: Using imputation quality as weight;
- rmMultiAllelicSite: Default is TRUE. Multi-allelic sites will be removed from the analyses if set TRUE, and a variable posMulti will be output; The variant site with multiple alleles can be analyzed using rareGWAMA.single.multiAllele function;
2.The out put should be as follows:
head(res$res.formatted))
POS REF ALT AF STAT PVALUE BETA SD N numStability
[1,] "9:97018619" "T" "C" "8.82e-05" "0.093" "0.76" "0.131" "0.428" "54235" "0"
- For demo data, please see
?rareGWAMA.cond.single
.
For more details, please see the SHOWCASE
1.The command should be like:
res.gene <- rareGWAMA.gene(score.stat.file, imp.qual.file=imp.qual.file, vcf.ref.file, refFileFormat="vcf.vbi", anno=anno, annoType=c('Nonsynonymous','Stop_Gain',"Essential_Splice_site"), rvtest='VT', ref.ancestry=ref.ancestry, trans.ethnic=TRUE, study.ancestry=study.ancestry, maf.cutoff=0.01);
please find more details in the input and arguments part for the arguments:
- score.stat.file: The file names of score statistic files, which could be a vector object;
- imp.qual.file: The file names of imputation quality, which could be a vector object;
- vcf.ref.file: The file names of the reference panel file, e.g. could be downloaded from 1000 Genomes Project.
- anno: Annotation file or list;
- annoType: The annotation types you want to use;
- rvtest: The method you want to use (i.e. 'VT', 'BURDEN', 'SKAT');
- ref.ancestry: The ancestry information for each sample;
- study.ancestry: The ancestry information for each study;
- maf.cutoff: The cutoff for the MAF, could be 0.01, 0.05, 0.001, whatever you want;
2.The out put should be as follows:
GENE RANGE STAT P-VALUE MAF_CUTOFF NUM_VAR TOTAL_MAF POS_VAR N POS_SINGLE_MINP BETA_SINGLE_MINP SD_SINGLE_MINP
MATN2 8:97931370-98033638 2.19 0.353 0.009472856879857 3 0.0123 8:97961468_G/A,8:98018087_G/A,8:98021213_G/A 204783 8:98021213_G/A 0.01813449913
STK3 8:98767360-98767360 0.0894 0.765 0.00582326555223991 1 0.00582 8:98767360_C/T 235921 8:98767360_C/T 0.00505283659608632 0.0168969355386225
VPS13B 8:99121478-99853608 0.305 0.752 0.00630135839199519 2 0.0117 8:99778930_G/A,8:99820031_A/G 283684 8:99778930_G/A 0.0114882148633614 0.016
COX6C 8:99892015-99892015 0.349 0.555 0.00244114819976761 1 0.00244 8:99892015_G/A 231499 8:99892015_G/A -0.0159448806371386 0.0269803627402012
👉 All the score statistics files and imputation quality files should be tabix indexed.
Say if you have such files: study1.gz(score statistics files), study2.gz, study1.R2.gz(imputation quality files), study2.R2.gz
in your folder, and already added tabix to your $PATH
environment variable, you could use:
for file in study*;do g=`zcat $file | head -200 | grep -n CHROM | cut -f1 -d":"`; tabix -f -S $g -s 1 -b 2 -e 2 $file; done
to tabix each file.
If you use 1 RVTESTS, the output is ready to go:
CHROM POS REF ALT N_INFORMATIVE AF INFORMATIVE_ALT_AC CALL_RATE HWE_PVALUE N_REF N_HET N_ALT U_STAT SQRT_V_STAT ALT_EFFSIZE PVALUE
1 10177 A AC 2352 0.5 2352 1 0 0 2352 0 1.67496 2.51553 0.264695 0.505508
1 10235 T TA 2352 0 0 1 1 2352 0 0 -0.108472 0.207841 -2.51104 0.601742
1 10352 T TA 2352 0.5 2352 1 0 0 2352 0 0.665562 2.61389 0.0974122 0.799013
1 10539 C A 2352 0 0 1 1 2352 0 0 -0.00020902 0.0626305 -0.0532862 0.997337
An example file has the following format:
CHROM POS REF ALT Rsq
1 10177 A AC 0.00581
1 10235 T TA 0.00396
1 10352 T TA 0.00608
1 10539 C A 0.00154
1 10616 CCGCCGTTGCAAAGGCGCGCCG C 0.02085
1 10642 G A 0.00013reference allele
There are five columns, which are Chromosome #
, Position
, Reference allele
, Alternative allele
and R square quality
. The higher of R square quality
(the 5th column) value, the better of the genotype imputation quality.
The file names of the reference panel file.
You could download the files from 1000 Genomes Project: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502.
Also, you could subset the ranges you want by using tabix, with command looks like:
tabix -h ALL.2of4intersection.20100804.genotypes.vcf.gz 2:39967768-39967768
For more details, please see: How do I get a sub-section of a VCF file?;
chrom pos ref alt af anno gene
112207 1 32247432 G A 0.03425920 Nonsynonymous FAM167B
112208 1 32247598 A G 0.07993410 Exon FAM167B
112209 1 32247709 G T 0.00645166 Intron FAM167B
112210 1 32248405 G C 0.00290383 Nonsynonymous FAM167B
140157 1 41479019 C T 0.09314330 Utr3 EDN2
140158 1 41479369 C T 0.06422470 Utr3 EDN2
1.ref.ancestry.
You should have a original file (i.e. ref.ancestry.ori
) as:
head(ref.ancestry.ori)
fid ancestry study
1 samp1 HIM HCHS
2 samp2 HIM HCHS
3 samp3 HIM HCHS
4 samp4 HIM HCHS
5 samp5 HIM HCHS
6 samp6 HIM HCHS
Then, you use:
ref.ancestry <- cbind(ref.ancestry.ori[,1], paste(ref.ancestry.ori[,2], ref.ancestry.ori[,3], sep=","))
So, the final format should be a matrix:
[,1] [,2]
[1,] "samp1" "HIM,HCHS"
[2,] "samp2" "HIM,HCHS"
[3,] "samp3" "HIM,HCHS"
[4,] "samp4" "HIM,HCHS"
[5,] "samp5" "HIM,HCHS"
[6,] "samp6" "HIM,HCHS"
2.study.ancestry.
Just a vector, as
[1] "AACAC" "AMISH" "ARIC" "BAGS" "CFS" "CHS"
Questions and requests can be sent to Github issue page (link) or Dajiang Liu ([email protected]) and Fang Chen([email protected])
1: Xiaowei Zhan, Youna Hu, Bingshan Li, Goncalo R. Abecasis, and Dajiang J. Liu
RVTESTS: An Efficient and Comprehensive Tool for Rare Variant Association Analysis Using Sequence Data
Bioinformatics 2016 32: 1423-1426. doi:10.1093/bioinformatics/btw079 (PDF)