flashpcaR

FlashPCA performs fast principal component analysis (PCA) of single nucleotide polymorphism (SNP) data, similar to smartpca from EIGENSOFT (http://www.hsph.harvard.edu/alkes-price/software/) and shellfish (https://github.com/dandavison/shellfish). FlashPCA is based on the https://github.com/yixuan/spectra/ library.

Main features:

Fast: partial PCA (k = 20 dimensions) of 500,000 individuals with 100,000 SNPs in <6h using 2GB RAM
Scalable: memory requirements are bounded, scales to at least 1M individuals
Highly accurate results
Natively reads PLINK bed/bim/fam files
Easy to use; can be called entirely within R (package flashpcaR)

Installation

Install the development version from GitHub:

# install.packages("remotes")
remotes::install_github("umr1283/flashpcaR")

Example

PCA

On a numeric matrix

data(hm3.chr1)
X <- scale2(hm3.chr1$bed)
dim(X)
f <- flashpca(X, ndim = 10, scale = "none")

On PLINK data

You can supply a path to a PLINK dataset (with extensions .bed/.bim/.fam, all lowercase):

(fn <- gsub("\\.bed", "", system.file("extdata", "data_chr1.bed", package = "flashpcaR")))
f <- flashpca(fn, ndim = 10)

UCCA (aka univariate canonical correlation analysis; aka ANOVA of each SNP on multiple phenotypes)

On a numeric matrix

Use HapMap3 genotypes, standardise them, simulate some phenotypes, and test each SNP for association with all phenotypes:

data(hm3.chr1)
X <- scale2(hm3.chr1$bed)
k <- 10
B <- matrix(rnorm(ncol(X) * k), ncol = k)
Y <- X %*% B + rnorm(nrow(X) * k)
f1 <- ucca(X, Y, standx = "none", standy = "sd")
head(f1$result)

On PLINK data

(fn <- gsub("\\.bed", "", system.file("extdata", "data_chr1.bed", package = "flashpcaR")))
f2 <- ucca(fn, Y, standx = "binom2", standy = "sd")
head(f2$result)

Sparse Canonical Correlation Analysis (SCCA)

On a numeric matrix

Use HapMap3 genotypes, standardise them, simulate some phenotypes, and run sparse canonical correlation analysis over all SNPs and all phenotypes:

data(hm3.chr1)
X <- scale2(hm3.chr1$bed)
k <- 10
B <- matrix(rnorm(ncol(X) * k), ncol = k)
Y <- X %*% B + rnorm(nrow(X) * k)
f1 <- scca(X, Y, standx = "none", standy = "sd", lambda1 = 1e-2, lambda2 = 1e-3)
diag(cor(f1$Px, f1$Py))

# 3-fold cross-validation
cv1 <- cv.scca(
   X, Y,
   standx = "sd",
   standy = "sd",
   lambda1 = seq(1e-3, 1e-1, length = 10),
   lambda2 = seq(1e-6, 1e-3, length = 5),
   ndim = 3,
   nfolds = 3
)

# Plot the canonical correlations over the penalties, for the 1st dimension
plot(cv1, dim = 1)

On PLINK data

fn <- gsub("\\.bed", "", system.file("extdata", "data_chr1.bed", package = "flashpcaR"))
fn
f2 <- scca(fn, Y, standx = "binom2", standy = "sd", lambda1 = 1e-2, lambda2 = 1e-3)
diag(cor(f2$Px, f2$Py))
# Cross-validation isn't yet supported for PLINK data

Help

Google Groups: https://groups.google.com/forum/#!forum/flashpca-users

Contact

Gad Abraham, [email protected]

Citation

version ≥2: G. Abraham, Y. Qiu, and M. Inouye, ``FlashPCA2: principal component analysis of biobank-scale genotype datasets'', (2017) Bioinformatics 33(17): 2776-2778. doi:10.1093/bioinformatics/btx299 (bioRxiv preprint https://doi.org/10.1101/094714)

version ≤1.2.6: G. Abraham and M. Inouye, ``Fast Principal Component Analysis of Large-Scale Genome-Wide Data'', (2016) PLOS ONE 9(4): e93766. doi:10.1371/journal.pone.0093766

License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

Name		Name	Last commit message	Last commit date
Latest commit History 450 Commits
.github		.github
R		R
data		data
inst		inst
man		man
src		src
tests/testthat		tests/testthat
vignettes		vignettes
.Rbuildignore		.Rbuildignore
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

flashpcaR

Installation

Example

PCA

On a numeric matrix

On PLINK data

UCCA (aka univariate canonical correlation analysis; aka ANOVA of each SNP on multiple phenotypes)

On a numeric matrix

On PLINK data

Sparse Canonical Correlation Analysis (SCCA)

On a numeric matrix

On PLINK data

Help

Contact

Citation

License

About

Releases

Languages

License

umr1283/flashpcaR

Folders and files

Latest commit

History

Repository files navigation

flashpcaR

Installation

Example

PCA

On a numeric matrix

On PLINK data

UCCA (aka univariate canonical correlation analysis; aka ANOVA of each SNP on multiple phenotypes)

On a numeric matrix

On PLINK data

Sparse Canonical Correlation Analysis (SCCA)

On a numeric matrix

On PLINK data

Help

Contact

Citation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Languages