Doc update: fix menu header on two pages, fix computing_pcs page

FossilOrigin-Name: 41819c4243e08d6f2885087caf1243b471657e6c1d42cf606b59f66aa4e62973
gavinband · Nov 9, 2017 · 9852ae9 · 9852ae9
1 parent 5e04895
commit 9852ae9
Show file tree

Hide file tree

Showing 3 changed files with 140 additions and 122 deletions.
diff --git a/doc/html/qctool/documentation/examples/combining.html b/doc/html/qctool/documentation/examples/combining.html
@@ -32,86 +32,88 @@
 
 		<section class="main_text">
 			<h2>Combining datasets</h2>
-				<div class="task">
-					<div class="task_name">
-						Combining several datasets into a larger one
-					</div>
-					<div class="task_notes">
-						QCTOOL's <code>-g</code> and <code>-s</code> options can be specified several times, with the effect of combining
-						data into one larger dataset.  E.g.:
-					</div>
-					<div class="task_command_line">
-						$ qctool -g cohort1.bgen -s cohort1.sample -g cohort2.bgen -s cohort2.sample -og joined.gen -os joined.sample
-					</div>
-					<div class="task_notes">
-						<p>
-							QCTOOL will produce both a combined genotype file and a combined sample file.
-							The output files
-							will have <em>N<sub>1</sub>+N<sub>2</sub></em> samples, where <em>N<sub>1</sub></em> and <em>N<sub>2</sub></em>
-							are the numbers of samples in the two input datasets, and it will contain all variants that can be matched between
-							the input datasets.
-						</p>
-						<p>
-							QCTOOL attempts to merge the columns of the sample files based on column name and type.
-							Columns are dropped if they have the same name but different types.	 Otherwise, each column
-							in the input sample files appears in the output sample file, possibly with missing values for those cohorts
-							for which that column is not present.
-						</p>
-						<p>
-							QCTOOL attempts to combine data at each variant that is present in all input datasets.  To do this,
-							it makes the assumption that <em>data is sorted uniquely and in the same way in all input datasets</em>.
-							If this is not the case then QCTOOL may be unable to match some variants.  Any variant that does not match
-							between datasets will be omitted from the output.
-						</p>
-					</div>
-					<div class="task_name">
-						Controlling how variants are matched
-					</div>
-					<div class="task_notes">
-						<p>
-							The <code>-compare-variants-by</code> option can be used to
-							control what fields QCTOOL compares when matching variants.  (The default behaviour is to match variants by the
-							genomic position, alleles, and ID fields).  For example, in the command:
-						</p>
-					</div>
-					<div class="task_command_line">
-						$ qctool -g cohort1.bgen -s cohort1.sample -g cohort2.bgen -s cohort2.sample -og joined.gen -os joined.sample -compare-variants-by position,alleles
-					</div>
-					<div class="task_notes">
+			<div class="task">
+				<div class="task_name">
+					Combining several datasets into a larger one
+				</div>
+				<div class="task_notes">
+					QCTOOL's <code>-g</code> and <code>-s</code> options can be specified several times, with the effect of combining
+					data into one larger dataset.  E.g.:
+				</div>
+				<div class="task_command_line">
+					$ qctool -g cohort1.bgen -s cohort1.sample -g cohort2.bgen -s cohort2.sample -og joined.gen -os joined.sample
+				</div>
+				<div class="task_notes">
+					<p>
+						QCTOOL will produce both a combined genotype file and a combined sample file.
+						The output files
+						will have <em>N<sub>1</sub>+N<sub>2</sub></em> samples, where <em>N<sub>1</sub></em> and <em>N<sub>2</sub></em>
+						are the numbers of samples in the two input datasets, and it will contain all variants that can be matched between
+						the input datasets.
+					</p>
+					<p>
+						QCTOOL attempts to merge the columns of the sample files based on column name and type.
+						Columns are dropped if they have the same name but different types.	 Otherwise, each column
+						in the input sample files appears in the output sample file, possibly with missing values for those cohorts
+						for which that column is not present.
+					</p>
 					<p>
-						QCTOOL will match variants by position and alleles (variants will match even if ID fields differ).  
-						As in the example, <code>position</code> (the genomic position) must always be the first field matched on.
+						QCTOOL attempts to combine data at each variant that is present in all input datasets.  To do this,
+						it makes the assumption that <em>data is sorted uniquely and in the same way in all input datasets</em>.
+						If this is not the case then QCTOOL may be unable to match some variants.  Any variant that does not match
+						between datasets will be omitted from the output.
 					</p>
-					<p><b>Note:</b> careful use of these options is especially important when datasets contain multiple variants 
-						sharing the same genomic position.  The recommendation is to ensure all datasets are encoded
-						uniformly before combining them.  (See the pages on <a href="sorting.html">sorting</a>, <a href="aligning.html">aligning alleles</a>, and
-						<a href="manipulating.html">manipulating metadata<a> for options that can help with this.)
+				</div>
+				<div class="task_name">
+					Controlling how variants are matched
+				</div>
+				<div class="task_notes">
+					<p>
+						The <code>-compare-variants-by</code> option can be used to
+						control what fields QCTOOL compares when matching variants.  (The default behaviour is to match variants by the
+						genomic position, alleles, and ID fields).  For example, in the command:
 					</p>
 				</div>
-					<div class="task_name">
-						Allowing for allele mismatches
-					</div>
-					<div class="task_notes">
+				<div class="task_command_line">
+					$ qctool -g cohort1.bgen -s cohort1.sample -g cohort2.bgen -s cohort2.sample -og joined.gen -os joined.sample -compare-variants-by position,alleles
+				</div>
+				<div class="task_notes">
+				<p>
+					QCTOOL will match variants by position and alleles (variants will match even if ID fields differ).  
+					As in the example, <code>position</code> (the genomic position) must always be the first field matched on.
+				</p>
+				<p><b>Note:</b> careful use of these options is especially important when datasets contain multiple variants 
+					sharing the same genomic position.  The recommendation is to ensure all datasets are encoded
+					uniformly before combining them.  (See the pages on <a href="sorting.html">sorting</a>,
+					<a href="aligning.html">aligning alleles</a>, and
+					<a href="manipulating.html">manipulating metadata</a> for options that can help with this.)
+				</p>
+			</div>
+				<div class="task_name">
+					Allowing for allele mismatches
+				</div>
+				<div class="task_notes">
+				<p>
+					It is sometimes the case that data is sorted in each dataset but that alleles are mixed up.
+					The <code>-match-alleles-to-cohort1</code> option tells QCTOOL to attempt to match data allowing for this
+					type of mismatch:
+				</p>
+			</div>
+				<div class="task_command_line">
+					$ qctool -g example.bgen -s example.sample -g second_cohort_#.gen -s second_cohort.sample
+					-og joined.gen -os joined.sample -compare-variants-by position,alleles -match-alleles-to-cohort1
+				</div>
+				<div class="task_notes">
 					<p>
-						It is sometimes the case that data is sorted in each dataset but that alleles are mixed up.
-						The <code>-match-alleles-to-cohort1</code> option tells QCTOOL to attempt to match data allowing for this
-						type of mismatch:
+					  Here, if the variant is biallelic and the alleles in the second dataset are the same as those
+						in the first cohort, but coded the other way round (e.g. cohort1 = A/G, cohort2 = G/A), then the
+						alleles and genotypes are flipped accordingly.  No other transformation (e.g. strand flips, or matching multiallelics)
+						is performed by this operation.  Thus, although this option can be convenient, the general recommendation
+						is to arrange each dataset to be encoded and sorted unformly before combining.
 					</p>
 				</div>
-					<div class="task_command_line">
-						$ qctool -g example.bgen -s example.sample -g second_cohort_#.gen -s second_cohort.sample
-						-og joined.gen -os joined.sample -compare-variants-by position,alleles -match-alleles-to-cohort1
-					</div>
-					<div class="task_notes">
-						<p>
-						  Here, if the variant is biallelic and the alleles in the second dataset are the same as those
-							in the first cohort, but coded the other way round (e.g. cohort1 = A/G, cohort2 = G/A), then the
-							alleles and genotypes are flipped accordingly.  No other transformation (e.g. strand flips, or matching multiallelics)
-							is performed by this operation.  Thus, although this option can be convenient, the general recommendation
-							is to arrange each dataset to be encoded and sorted unformly before combining.
-						</p>
-					</div>
-				</section>
+			</div>
+		</section>
 
 		<nav class="button_bar">
 			<div class="nav_button" name="overview">

diff --git a/doc/html/qctool/documentation/examples/computing_pcs.html b/doc/html/qctool/documentation/examples/computing_pcs.html
@@ -32,71 +32,78 @@
 
 		<section class="main_text">
 			<h2>Computing principal components</h2>
-			<p>
-				Suppose <em>X</em> is a matrix of genotypes at a set of <em>L</em> biallelic
-				variants (rows) for a collection of <em>N</em> samples (columns).
-				We are interested in finding the eigenvectors of <em>1/N X<sup>t</sup>X</em>
-				(the principal components)
-				and of <em>1/L X X<sup>t</sup></em> (the 'loadings').
-				The two are related by the singular value decomposition, which says that
-				<em>X = U D V<sup>t</sup></em>
-				where the columns of V<sup>t</sup> and the rows of U are the required eigenvectors, and <em>D</em>
-				is a diagonal matrix.
-			</p>
-			<p>
-				In practice principal components analysis (PCA) is usually carried out on the
-				matrix of genotypes scaled by frequency, i.e.
-				<em>x<sub>ij</sub> = g<sub>ij</sub> / &radic; f(1-f)</em>.  One reason for this is that
-				(in simple theoretical situations) the rate of genetic drift is proportional
-				to <em>&radic;f(1-f)</em>, where <em>f</em> is the ancestral allele frequency.
-				If PCA is carried out on a sample from a set of diverged populations, the overall
-				allele frequency can be though of as approximating the ancestral frequency, and the
-				principal components will measure genetic drift between samples.
-			</p>
-			<p>
-				The matrix <em>Z = 1/N X<sup>t</sup>X</em> can be thought of as a 'relatedness' matrix and is
-				useful in its own right.  In simple theoretical scenarious, with the scaling <em>&radic;f(1-f)</em>,
-				the matrix <em>Z</em> is an estimate of the 
-				matrix of <a href="https://en.wikipedia.org/wiki/Coefficient_of_relationship">kinship
-				coefficients</a>.  In practical situations <em>Z</em> is best thought of as providing a measure
-				of relatedness relative to the sample; it can nevertheless be useful in identifying
-				close relationships.
-			</p>
-			<p>
-				QCTOOL computes principal components by solving for eigenvectors of <em>1/N X<sup>t</sup>X</em>.
-				It computes loadings by projecting genotypes at each variant onto
-				those eigenvectors, namely as <em>V<sup>t</sup></em> = U<sup>-1</sup> D<sup>-1</sup> X</em>.
-			</p>
-
+				<div class="task">
+					<div class="task_name">
+						Computing PCs
+					</div>
+					<div class="task_notes">
+						QCTOOL computes PCs by first estimating a relatedness or 'kinship' matrix, and then forming the eigendecomposition. 
+						In short you use the <code>-kinship</code> option to compute a
+						relatedness matrix, the <code>-UDUT</code> option to eigendecompose it, and the <code>-PCs</code> option
+						to output PCs.  A complete example would look like this:
+					</div>
+					<div class="task_command_line">
+						$ qctool -g example.bgen -s example.sample -kinship kinship.csv -UDUT udut.csv -PCs 20 -osample PCs.csv
+					</div>
+					<div class="task_notes">
+						<p>
+							This outputs the first 20 PCs to the file <code>PCs.csv</code>, in addition to the estimated kinship
+							matrix and its eigendecomposition.  The following sections show the use of these options in more detail.
+					</p>
+					</div>
+				</div>
 				<div class="task">
 					<div class="task_name">
 						Computing a kinship matrix
 					</div>
 					<div class="task_notes">
-						The <code>-kinship</code> option can be used to compute a kinship matrix, e.g.:
+						The <code>-kinship</code> option can be used to estimate a kinship matrix, as in:
 					</div>
 					<div class="task_command_line">
 						$ qctool -g example.bgen -s example.sample -kinship kinship.csv
 					</div>
 					<div class="task_notes">
-						This outputs pairwise kinship values to the file <code>kinship.csv</code>, which is stored in a 'long' format
-						with columns holding the first sample id, second sample id, the number of pairwise non-missing genotypes,
-						and the estimated kinship value.
+						<p>
+							This outputs pairwise kinship values to the file <code>kinship.csv</code>, which is stored in a 'long' format
+							with columns holding the first sample id, second sample id, the number of pairwise non-missing genotypes,
+							and the estimated kinship value.  (Only the upper triangle of this matrix is output).
+						</p>
+						<p>
+							More precisely,  Suppose <em>X</em> is the <em>L&times;N</em> matrix of
+							genotypes, with variants indexed by row.
+							Let <em>f<sub>i</sub></em> be an estimate of the frequency of the <em>i</em>th variant.
+							We write <em>Z</em> for the matrix <em>X</em> after centring and rescaling each row based on the allele frequency,
+						</p>
+						<p>
+							<em>Z<sub>i&middot;</sub> = (X<sub>i&middot;</sub> - mean(X<sub>i&middot;</sub>)) / &Sqrt; (2 f<sub>i</sub> (1-f<sub>i</sub>))</em>
+						</p>
+						<p>
+							QCTOOL estimates the kinship matrix as <em>1/L Z^t Z</em>.
+							In forming <em>Z</em>, QCTOOL uses a posterior estimate of allele
+							frequency <em>f<sub>i</sub></em> under a <em>Beta(2,2)</em>
+							distribution, i.e. <em>f<sub>i</sub> = (1+N<sub>b</sub>)/(2+2N))</em> where <em>N<sub>b</sub></em> is
+							the count number of 'b' alleles in the data.  This can be understood as implicitly adding
+							a single haplotype of each allelic type to the data before
+							computing the frequency, which in turn ensures that the frequency estimate is not zero or 1.  
+						</p>
 					</div>
 				</div>
 				<div class="task">
 					<div class="task_name">
 						Eigendecomposing a kinship matrix and computing PCs
 					</div>
 					<div class="task_notes">
-						The <code>-UDUT</code> option can be used to compute a UDUT decomposition of the computed kinship matrix.
+						The <code>-UDUT</code> option can be used to compute a UDUT decomposition (i.e. an eigendecomposition)
+						of the computed kinship matrix.
 							E.g.
 					</div>
 					<div class="task_command_line">
 						$ qctool -g example.bgen -s example.sample -kinship kinship.csv -UDUT udut.csv
 					</div>
 					<div class="task_notes">
-						This computes a UDUT decomposition of the kinship matrix.  To additionally output principal components (PCs),
+						The output is an <em>N&times;(N+1)</em> matrix in which the first column represents the diagonal elements
+						of <em>D</em>, i.e. the eigenvalues, and the following <em>N</em> columns are the right eigenvectors
+						(i.e. the columns of <em>U</em>).  To additionally output principal components (PCs),
 						additionally add the <code>-PCs</code> option:
 					</div>
 					<div class="task_command_line">
@@ -106,9 +113,15 @@ <h2>Computing principal components</h2>
 					<p>
 						The argument is the number of PCs to output.
 					</p>
+					<p>
+						<b>Note</b>: the PCs computed are simply rescaled entries of the right eigenvectors;
+						they are computed as <em>PC<sub>i</sub></em> = &Sqrt;(1/L) &times; U<sub>&middot;i</sub> D<sup>-1/2</sup></em>.
+						This scaling ensures the PCs do not grow with the number of variants.
+					</p>
 					<p>
 						<b>Note:</b> PCs are output to the file specified by <code>-osample</code>.
-						Specifying both <code>-sample-stats</code> and <code>-PCs</code> leads to a file that contains both
+						Depending on the command line, other values might also be output to this file.  For example, 
+						if you specify both <code>-sample-stats</code> and <code>-PCs</code>, the output file will contain both
 						per-sample summary statistics and PCs.
 						See the <a href="../../documentation/summary_file_formats.html">page
 							on summary statistic file formats</a> for more information on the format of the output.
@@ -136,9 +149,9 @@ <h2>Computing principal components</h2>
 						$ qctool -g example.bgen -s example.sample -load-udut udut.csv -loadings loadings.csv
 					</div>
 					<div class="task_notes">
-						The <code>-nPCs</code> option can again be used to adjust how many loadings are computed.
-						(You should ensure the same set of variants is used to compute
-						loadings as were used in constructing the kinship matrix).
+						The <code>-PCs</code> option can again be used to adjust how many loadings are computed.
+						<b>Note</b>: you should ensure the same set of variants is used to compute
+						loadings as were used in constructing the kinship matrix.
 					</div>
 				</div>
 
@@ -155,6 +168,7 @@ <h2>Computing principal components</h2>
 					</div>
 					<div class="task_notes">
 						The first argument is a set of loadings to load, while the second argument is the output file to compute.
+						Again, we advise using the same set of variants that were used to compute the PCs.
 					</div>
 				</div>
 		</section>

diff --git a/doc/html/qctool/documentation/examples/pipelining.html b/doc/html/qctool/documentation/examples/pipelining.html
@@ -77,15 +77,17 @@ <h2>Using qctool in a pipeline</h2>
 					</div>
 					<div class="task_notes">
 						As an example, the following command uses <a href="http://bitbucket.org/gavinband/bgen">bgenix</a> with QCTOOL to
-						compute snp summary statistics from a subset of a BGEN file, and view the result on the fly using <a href="https://en.wikipedia.org/wiki/Less_(Unix)"><code>less</code><a>.
+						compute snp summary statistics from a subset of a BGEN file, and view the result on the fly using
+						<a href="https://en.wikipedia.org/wiki/Less_(Unix)"><code>less</code></a>.
 					</div>
 					<div class="task_command_line">
 						$ bgenix -g file.bgen -range 11:3500000-6500000 | qctool -g - -filetype bgen -snp-stats -osnp - | less -S
 					</div>
 				</div>
 		</section>
 
-		<nav class="button_bar">			<div class="nav_button" name="overview">
+		<nav class="button_bar">
+			<div class="nav_button" name="overview">
 				<a href="../../index.html">overview</a>
 			</div>
 			<div class="nav_button" name="documentation">
@@ -125,7 +127,7 @@ <h2>Using qctool in a pipeline</h2>
 			<div class="nav_button" name="download">
 				<a href="../../documentation/download.html">download</a>
 			</div>
-</nav>
+		</nav>
 
 	</body>
 </html>