new questions added

mmosmond · Oct 30, 2015 · 991d8e5 · 991d8e5
1 parent 28410b7
commit 991d8e5
Show file tree

Hide file tree

Showing 11 changed files with 495 additions and 221 deletions.
diff --git a/chapter-01.tex b/chapter-01.tex
diff --git a/chapter-02.tex b/chapter-02.tex
@@ -380,142 +380,6 @@ \subsection{Neutral diversity and population structure}
 island. Therefore, considering our island our sub-population we have
 derived another simple model of $F_{ST}$ .
 
-\subsection{Other approaches to population structure}
-There is a broad spectrum of methods to describe patterns of
-population structure in populaion genetic datasets. We'll briefly
-discuss two broad-classes of methods, assigment methods and principal
-components analysis,that appear often in the literature.
-
-\subsubsection{Assignment Methods}
-
-Here we'll describe a simple probabilistic assignment to find the
-probability that an individual of unknown population comes from one of
-$K$ predefined populations. We'll then briefly explain how to extend this
-to cluster individuals into $K$ initially unknown populations. This
-method is a simplified version of what Bayesian population genetics
-clustering algorithms such as STRUCTURE and ADMIXTURE do (Pritchard et al. Genetics 2000). 
-
-\paragraph{A simple assignment method}
-
-We have genotype data from unlinked S bi-allelic loci for $K$ populations. The allele frequency of allele $A_1$ at locus $l$ in population $k$ is denoted by $p_{k,l}$, so that the allele frequencies in population 1 are $p_{1,1},\cdots p_{1,L}$ and population 2 are $p_{2,1},\cdots p_{2,L}$ and so on. 
-
-You type a new individual from an unknown population at these $L$ loci. This individual's genotype at locus $l$ is $g_l$, where $g_l$ denotes the number of copies of allele $A_1$ this individual carries at this locus ($g_l=0,1,2$). 
-
-The probability of this individual's genotype at locus $l$ conditional on coming from population $k$ (i.e. their alleles being a random HW draw from population $k$) is 
-\begin{equation}
-P(g_l | \textrm{pop k}) = I(g_l=0) (1-p_{k,l})^2 +  I(g_l=1) 2 p_{k,l} (1-p_{k,l}) + I(g_l=2) p_{k,l}^2
-\end{equation}
-where $I(g_l=0)$ is an indicator function which is $1$ if $g_l=0$ and
-zero otherwise, and likewise for the other indicator functions. This
-follows simply from HWE.
-
-Assuming that the loci are independent, the probability of individual's genotypes conditional on them coming from population $k$ is 
-\begin{equation}
-P(\textrm{ind.} | \textrm{pop k})  = \prod_{l=1}^S P(g_l | \textrm{pop k}) \label{eqn_assignment}
-\end{equation}
-
-
-We wish to know the probability that this new individual comes from population $k$, i.e. $P(\textrm{pop k} | \textrm{new ind.})$. We can obtain this through Bayes rule 
-\begin{equation}
- P(\textrm{pop k} | \textrm{ind.})  = \frac{P(\textrm{ind.} | \textrm{pop k}) P(\textrm{pop k})}{P(\textrm{ind.})}
-\end{equation}
-where 
-\begin{equation}
-P(\textrm{ind.}) = \sum_{k=1}^K  P(\textrm{ind.} | \textrm{pop k}) P(\textrm{pop k})
-\end{equation}
-is the normalizing constant. We interpret $P(\textrm{pop k})$ as the
-prior probability of the individual coming from population $k$, unless
-we have some other prior knowledge we will assume that the new individual has a equal probability of coming from each population $P(\textrm{pop k})=1/K$.  
-
-We intepret 
-\begin{equation}
- P(\textrm{pop k} | \textrm{ind.})
-\end{equation}
-as the posterior probability that our new individual comes from each of our $1,\cdots, K$ populations.
-
-More sophisticated versions of this are now used to allow for hybrids,
-e.g, we can have a proportion $q_k$ of our individual's genome come
-from population $k$ and estimate the set of $q_k$'s.
-
-{\bf Question.} We have two populations where the frequency of allele
-$A_1$ at two SNPs ($A_1/A_2$)  is given by
-\begin{center}
-\begin{tabular}{|ccc|}
-\hline
-Population & locus 1 & locus 2 \\
-\hline
-A & $0.1$ & $0.85$ \\
-B  & $0.95$ & $0.2$ \\
-\hline
-\end{tabular}
-\end{center}
-We sample an individual whose genotype is $A_1A_1$ at the first locus
-and $A_2A_2$ at the second. What
-is the probability that our indvidual comes from population 1 vs
-population 2?
-Lets assume that with probability $q_1$ our individual draws an allele
-from population $1$ and that with probability $q_2=1-q_1$ they draw an allele from
-population $2$. What is the probability of our individual's genotype
-given $q_1$? Plot this probability as a function of $q_1$. How does 
-your plot change if our individual is heterozygote at both loci?
-
-
-\paragraph{Clustering based on assignment methods}
-While it is great to be able to assign our individuals to particular
-population, these ideas can be pushed to learn about how best to
-describe our genotype data in terms of discrete populations without
-assigning any of our individuals to populations {\it a priori}. 
-We wish to cluster our individuals into $K$ unknown populations. We begin by assigning our individuals at random to these $K$ populations. 
-\begin{itemize}
-\item Given these assignments we estimate the allele frequencies at all of our loci in each population. 
-\item Given these allele frequencies we chose to reassign each individual to a population $k$ with a probability given by eqn. ($\ref{eqn_assignment}$).
-\end{itemize}
-We iterate steps 1 and 2 for many iterations. If the data is sufficiently informative the assignments and allele frequencies will quickly converge. 
-
-To do this in a full bayesian scheme we need to place priors on the
-allele frequencies (e.g. a beta distribution).Technically we are using
-this is the joint posterior of our allele frequencies and assignments. 
-
-\subsubsection{Principal components analysis}
-The use of principal component analysis in population genetics was
-pioneered by Cavalli-Sforza. With large genotyping datasets PCA has made
-a come back. See Patterson et al 2006, PLoS Genetics and McVean,
-G. 2010 PLoS Genetics and for recent discussion.
-
-Consider a dataset consisting of N individuals at S bi-allelic
-SNPs. The $i^{th}$ individual's genotype data at locus $\ell$ takes
-value $g_{i,\ell}$=0,1, or 2 (corresponding to the number of copies of
-allele $A_1$ an individual carrys at this SNP). We can think of this
-as a N x S matrix (where usually $N \ll S$). 
-
- Denoting the sample mean allele freq at SNP $\ell$ by $p_{\ell}$ we usually standardize the genotype in the following way
-\begin{equation}
-\frac{g_{i,\ell} - 2 p_{\ell}}{\sqrt{p_{\ell}(1-p_{\ell})}}
-\end{equation}
-i.e. at each SNP we center the genotypes by minusing of the mean
-genotype ($2\epsilon_{\ell}$) and divide through by the expected
-variance assuming that alleles are sampled binomially from the mean
-frequency ($\sqrt{p_{\ell} (1-p_{\ell})}$). Doing this to
-all of our genotypes we form a data matrix (of dimension N x S). We
-can then perform principal components analysis of this data matrix to
-cover the major axes of genotype variance in our sample.
-
-It is worth taking a moment to delve further into what we are doing
-here. There's a number of equivalent ways to thinking about what PCA
-is doing, one of these is to think that when we do PCA we are building the individual by individual
-covariance matrix and performing eigen-value decomposition of this
-matrix (with the eigen-vectors giving the PC).  This individual by individual covariance matrix has entries
-the $(i,~j)^{th}$ entry given by
-\begin{equation}
-\sum_{\ell=1}^S \frac{(g_{i,\ell} - 2p_{\ell})(g_{j,\ell} - 2p_{\ell})}{p_{\ell}(1-p_{\ell})}
-\end{equation}
-note that this is the covariance, is very similar to those we
-encountered in discussing $F$-statistics as correlations (equation
-\eqref{eqn:Fascorr}), expect now we are asking about the allelic covariance
-between two individuals above that expected if they were both drawn
-from the total sample at random (rather than the covariance of alleles
-within a single individual). So by performing PCA on the data we are
-learning about the major (orthogonal) axes of the kinship matrix.    
 
 \newpage