Skip to content

Commit

Permalink
Doc update: fix menu header on two pages, fix computing_pcs page
Browse files Browse the repository at this point in the history
FossilOrigin-Name: 41819c4243e08d6f2885087caf1243b471657e6c1d42cf606b59f66aa4e62973
  • Loading branch information
[email protected] committed Nov 9, 2017
1 parent 5e04895 commit 9852ae9
Show file tree
Hide file tree
Showing 3 changed files with 140 additions and 122 deletions.
148 changes: 75 additions & 73 deletions doc/html/qctool/documentation/examples/combining.html
Original file line number Diff line number Diff line change
Expand Up @@ -32,86 +32,88 @@

<section class="main_text">
<h2>Combining datasets</h2>
<div class="task">
<div class="task_name">
Combining several datasets into a larger one
</div>
<div class="task_notes">
QCTOOL's <code>-g</code> and <code>-s</code> options can be specified several times, with the effect of combining
data into one larger dataset. E.g.:
</div>
<div class="task_command_line">
$ qctool -g cohort1.bgen -s cohort1.sample -g cohort2.bgen -s cohort2.sample -og joined.gen -os joined.sample
</div>
<div class="task_notes">
<p>
QCTOOL will produce both a combined genotype file and a combined sample file.
The output files
will have <em>N<sub>1</sub>+N<sub>2</sub></em> samples, where <em>N<sub>1</sub></em> and <em>N<sub>2</sub></em>
are the numbers of samples in the two input datasets, and it will contain all variants that can be matched between
the input datasets.
</p>
<p>
QCTOOL attempts to merge the columns of the sample files based on column name and type.
Columns are dropped if they have the same name but different types. Otherwise, each column
in the input sample files appears in the output sample file, possibly with missing values for those cohorts
for which that column is not present.
</p>
<p>
QCTOOL attempts to combine data at each variant that is present in all input datasets. To do this,
it makes the assumption that <em>data is sorted uniquely and in the same way in all input datasets</em>.
If this is not the case then QCTOOL may be unable to match some variants. Any variant that does not match
between datasets will be omitted from the output.
</p>
</div>
<div class="task_name">
Controlling how variants are matched
</div>
<div class="task_notes">
<p>
The <code>-compare-variants-by</code> option can be used to
control what fields QCTOOL compares when matching variants. (The default behaviour is to match variants by the
genomic position, alleles, and ID fields). For example, in the command:
</p>
</div>
<div class="task_command_line">
$ qctool -g cohort1.bgen -s cohort1.sample -g cohort2.bgen -s cohort2.sample -og joined.gen -os joined.sample -compare-variants-by position,alleles
</div>
<div class="task_notes">
<div class="task">
<div class="task_name">
Combining several datasets into a larger one
</div>
<div class="task_notes">
QCTOOL's <code>-g</code> and <code>-s</code> options can be specified several times, with the effect of combining
data into one larger dataset. E.g.:
</div>
<div class="task_command_line">
$ qctool -g cohort1.bgen -s cohort1.sample -g cohort2.bgen -s cohort2.sample -og joined.gen -os joined.sample
</div>
<div class="task_notes">
<p>
QCTOOL will produce both a combined genotype file and a combined sample file.
The output files
will have <em>N<sub>1</sub>+N<sub>2</sub></em> samples, where <em>N<sub>1</sub></em> and <em>N<sub>2</sub></em>
are the numbers of samples in the two input datasets, and it will contain all variants that can be matched between
the input datasets.
</p>
<p>
QCTOOL attempts to merge the columns of the sample files based on column name and type.
Columns are dropped if they have the same name but different types. Otherwise, each column
in the input sample files appears in the output sample file, possibly with missing values for those cohorts
for which that column is not present.
</p>
<p>
QCTOOL will match variants by position and alleles (variants will match even if ID fields differ).
As in the example, <code>position</code> (the genomic position) must always be the first field matched on.
QCTOOL attempts to combine data at each variant that is present in all input datasets. To do this,
it makes the assumption that <em>data is sorted uniquely and in the same way in all input datasets</em>.
If this is not the case then QCTOOL may be unable to match some variants. Any variant that does not match
between datasets will be omitted from the output.
</p>
<p><b>Note:</b> careful use of these options is especially important when datasets contain multiple variants
sharing the same genomic position. The recommendation is to ensure all datasets are encoded
uniformly before combining them. (See the pages on <a href="sorting.html">sorting</a>, <a href="aligning.html">aligning alleles</a>, and
<a href="manipulating.html">manipulating metadata<a> for options that can help with this.)
</div>
<div class="task_name">
Controlling how variants are matched
</div>
<div class="task_notes">
<p>
The <code>-compare-variants-by</code> option can be used to
control what fields QCTOOL compares when matching variants. (The default behaviour is to match variants by the
genomic position, alleles, and ID fields). For example, in the command:
</p>
</div>
<div class="task_name">
Allowing for allele mismatches
</div>
<div class="task_notes">
<div class="task_command_line">
$ qctool -g cohort1.bgen -s cohort1.sample -g cohort2.bgen -s cohort2.sample -og joined.gen -os joined.sample -compare-variants-by position,alleles
</div>
<div class="task_notes">
<p>
QCTOOL will match variants by position and alleles (variants will match even if ID fields differ).
As in the example, <code>position</code> (the genomic position) must always be the first field matched on.
</p>
<p><b>Note:</b> careful use of these options is especially important when datasets contain multiple variants
sharing the same genomic position. The recommendation is to ensure all datasets are encoded
uniformly before combining them. (See the pages on <a href="sorting.html">sorting</a>,
<a href="aligning.html">aligning alleles</a>, and
<a href="manipulating.html">manipulating metadata</a> for options that can help with this.)
</p>
</div>
<div class="task_name">
Allowing for allele mismatches
</div>
<div class="task_notes">
<p>
It is sometimes the case that data is sorted in each dataset but that alleles are mixed up.
The <code>-match-alleles-to-cohort1</code> option tells QCTOOL to attempt to match data allowing for this
type of mismatch:
</p>
</div>
<div class="task_command_line">
$ qctool -g example.bgen -s example.sample -g second_cohort_#.gen -s second_cohort.sample
-og joined.gen -os joined.sample -compare-variants-by position,alleles -match-alleles-to-cohort1
</div>
<div class="task_notes">
<p>
It is sometimes the case that data is sorted in each dataset but that alleles are mixed up.
The <code>-match-alleles-to-cohort1</code> option tells QCTOOL to attempt to match data allowing for this
type of mismatch:
Here, if the variant is biallelic and the alleles in the second dataset are the same as those
in the first cohort, but coded the other way round (e.g. cohort1 = A/G, cohort2 = G/A), then the
alleles and genotypes are flipped accordingly. No other transformation (e.g. strand flips, or matching multiallelics)
is performed by this operation. Thus, although this option can be convenient, the general recommendation
is to arrange each dataset to be encoded and sorted unformly before combining.
</p>
</div>
<div class="task_command_line">
$ qctool -g example.bgen -s example.sample -g second_cohort_#.gen -s second_cohort.sample
-og joined.gen -os joined.sample -compare-variants-by position,alleles -match-alleles-to-cohort1
</div>
<div class="task_notes">
<p>
Here, if the variant is biallelic and the alleles in the second dataset are the same as those
in the first cohort, but coded the other way round (e.g. cohort1 = A/G, cohort2 = G/A), then the
alleles and genotypes are flipped accordingly. No other transformation (e.g. strand flips, or matching multiallelics)
is performed by this operation. Thus, although this option can be convenient, the general recommendation
is to arrange each dataset to be encoded and sorted unformly before combining.
</p>
</div>
</section>
</div>
</section>

<nav class="button_bar">
<div class="nav_button" name="overview">
Expand Down
106 changes: 60 additions & 46 deletions doc/html/qctool/documentation/examples/computing_pcs.html
Original file line number Diff line number Diff line change
Expand Up @@ -32,71 +32,78 @@

<section class="main_text">
<h2>Computing principal components</h2>
<p>
Suppose <em>X</em> is a matrix of genotypes at a set of <em>L</em> biallelic
variants (rows) for a collection of <em>N</em> samples (columns).
We are interested in finding the eigenvectors of <em>1/N X<sup>t</sup>X</em>
(the principal components)
and of <em>1/L X X<sup>t</sup></em> (the 'loadings').
The two are related by the singular value decomposition, which says that
<em>X = U D V<sup>t</sup></em>
where the columns of V<sup>t</sup> and the rows of U are the required eigenvectors, and <em>D</em>
is a diagonal matrix.
</p>
<p>
In practice principal components analysis (PCA) is usually carried out on the
matrix of genotypes scaled by frequency, i.e.
<em>x<sub>ij</sub> = g<sub>ij</sub> / &radic; f(1-f)</em>. One reason for this is that
(in simple theoretical situations) the rate of genetic drift is proportional
to <em>&radic;f(1-f)</em>, where <em>f</em> is the ancestral allele frequency.
If PCA is carried out on a sample from a set of diverged populations, the overall
allele frequency can be though of as approximating the ancestral frequency, and the
principal components will measure genetic drift between samples.
</p>
<p>
The matrix <em>Z = 1/N X<sup>t</sup>X</em> can be thought of as a 'relatedness' matrix and is
useful in its own right. In simple theoretical scenarious, with the scaling <em>&radic;f(1-f)</em>,
the matrix <em>Z</em> is an estimate of the
matrix of <a href="https://en.wikipedia.org/wiki/Coefficient_of_relationship">kinship
coefficients</a>. In practical situations <em>Z</em> is best thought of as providing a measure
of relatedness relative to the sample; it can nevertheless be useful in identifying
close relationships.
</p>
<p>
QCTOOL computes principal components by solving for eigenvectors of <em>1/N X<sup>t</sup>X</em>.
It computes loadings by projecting genotypes at each variant onto
those eigenvectors, namely as <em>V<sup>t</sup></em> = U<sup>-1</sup> D<sup>-1</sup> X</em>.
</p>

<div class="task">
<div class="task_name">
Computing PCs
</div>
<div class="task_notes">
QCTOOL computes PCs by first estimating a relatedness or 'kinship' matrix, and then forming the eigendecomposition.
In short you use the <code>-kinship</code> option to compute a
relatedness matrix, the <code>-UDUT</code> option to eigendecompose it, and the <code>-PCs</code> option
to output PCs. A complete example would look like this:
</div>
<div class="task_command_line">
$ qctool -g example.bgen -s example.sample -kinship kinship.csv -UDUT udut.csv -PCs 20 -osample PCs.csv
</div>
<div class="task_notes">
<p>
This outputs the first 20 PCs to the file <code>PCs.csv</code>, in addition to the estimated kinship
matrix and its eigendecomposition. The following sections show the use of these options in more detail.
</p>
</div>
</div>
<div class="task">
<div class="task_name">
Computing a kinship matrix
</div>
<div class="task_notes">
The <code>-kinship</code> option can be used to compute a kinship matrix, e.g.:
The <code>-kinship</code> option can be used to estimate a kinship matrix, as in:
</div>
<div class="task_command_line">
$ qctool -g example.bgen -s example.sample -kinship kinship.csv
</div>
<div class="task_notes">
This outputs pairwise kinship values to the file <code>kinship.csv</code>, which is stored in a 'long' format
with columns holding the first sample id, second sample id, the number of pairwise non-missing genotypes,
and the estimated kinship value.
<p>
This outputs pairwise kinship values to the file <code>kinship.csv</code>, which is stored in a 'long' format
with columns holding the first sample id, second sample id, the number of pairwise non-missing genotypes,
and the estimated kinship value. (Only the upper triangle of this matrix is output).
</p>
<p>
More precisely, Suppose <em>X</em> is the <em>L&times;N</em> matrix of
genotypes, with variants indexed by row.
Let <em>f<sub>i</sub></em> be an estimate of the frequency of the <em>i</em>th variant.
We write <em>Z</em> for the matrix <em>X</em> after centring and rescaling each row based on the allele frequency,
</p>
<p>
<em>Z<sub>i&middot;</sub> = (X<sub>i&middot;</sub> - mean(X<sub>i&middot;</sub>)) / &Sqrt; (2 f<sub>i</sub> (1-f<sub>i</sub>))</em>
</p>
<p>
QCTOOL estimates the kinship matrix as <em>1/L Z^t Z</em>.
In forming <em>Z</em>, QCTOOL uses a posterior estimate of allele
frequency <em>f<sub>i</sub></em> under a <em>Beta(2,2)</em>
distribution, i.e. <em>f<sub>i</sub> = (1+N<sub>b</sub>)/(2+2N))</em> where <em>N<sub>b</sub></em> is
the count number of 'b' alleles in the data. This can be understood as implicitly adding
a single haplotype of each allelic type to the data before
computing the frequency, which in turn ensures that the frequency estimate is not zero or 1.
</p>
</div>
</div>
<div class="task">
<div class="task_name">
Eigendecomposing a kinship matrix and computing PCs
</div>
<div class="task_notes">
The <code>-UDUT</code> option can be used to compute a UDUT decomposition of the computed kinship matrix.
The <code>-UDUT</code> option can be used to compute a UDUT decomposition (i.e. an eigendecomposition)
of the computed kinship matrix.
E.g.
</div>
<div class="task_command_line">
$ qctool -g example.bgen -s example.sample -kinship kinship.csv -UDUT udut.csv
</div>
<div class="task_notes">
This computes a UDUT decomposition of the kinship matrix. To additionally output principal components (PCs),
The output is an <em>N&times;(N+1)</em> matrix in which the first column represents the diagonal elements
of <em>D</em>, i.e. the eigenvalues, and the following <em>N</em> columns are the right eigenvectors
(i.e. the columns of <em>U</em>). To additionally output principal components (PCs),
additionally add the <code>-PCs</code> option:
</div>
<div class="task_command_line">
Expand All @@ -106,9 +113,15 @@ <h2>Computing principal components</h2>
<p>
The argument is the number of PCs to output.
</p>
<p>
<b>Note</b>: the PCs computed are simply rescaled entries of the right eigenvectors;
they are computed as <em>PC<sub>i</sub></em> = &Sqrt;(1/L) &times; U<sub>&middot;i</sub> D<sup>-1/2</sup></em>.
This scaling ensures the PCs do not grow with the number of variants.
</p>
<p>
<b>Note:</b> PCs are output to the file specified by <code>-osample</code>.
Specifying both <code>-sample-stats</code> and <code>-PCs</code> leads to a file that contains both
Depending on the command line, other values might also be output to this file. For example,
if you specify both <code>-sample-stats</code> and <code>-PCs</code>, the output file will contain both
per-sample summary statistics and PCs.
See the <a href="../../documentation/summary_file_formats.html">page
on summary statistic file formats</a> for more information on the format of the output.
Expand Down Expand Up @@ -136,9 +149,9 @@ <h2>Computing principal components</h2>
$ qctool -g example.bgen -s example.sample -load-udut udut.csv -loadings loadings.csv
</div>
<div class="task_notes">
The <code>-nPCs</code> option can again be used to adjust how many loadings are computed.
(You should ensure the same set of variants is used to compute
loadings as were used in constructing the kinship matrix).
The <code>-PCs</code> option can again be used to adjust how many loadings are computed.
<b>Note</b>: you should ensure the same set of variants is used to compute
loadings as were used in constructing the kinship matrix.
</div>
</div>

Expand All @@ -155,6 +168,7 @@ <h2>Computing principal components</h2>
</div>
<div class="task_notes">
The first argument is a set of loadings to load, while the second argument is the output file to compute.
Again, we advise using the same set of variants that were used to compute the PCs.
</div>
</div>
</section>
Expand Down
8 changes: 5 additions & 3 deletions doc/html/qctool/documentation/examples/pipelining.html
Original file line number Diff line number Diff line change
Expand Up @@ -77,15 +77,17 @@ <h2>Using qctool in a pipeline</h2>
</div>
<div class="task_notes">
As an example, the following command uses <a href="http://bitbucket.org/gavinband/bgen">bgenix</a> with QCTOOL to
compute snp summary statistics from a subset of a BGEN file, and view the result on the fly using <a href="https://en.wikipedia.org/wiki/Less_(Unix)"><code>less</code><a>.
compute snp summary statistics from a subset of a BGEN file, and view the result on the fly using
<a href="https://en.wikipedia.org/wiki/Less_(Unix)"><code>less</code></a>.
</div>
<div class="task_command_line">
$ bgenix -g file.bgen -range 11:3500000-6500000 | qctool -g - -filetype bgen -snp-stats -osnp - | less -S
</div>
</div>
</section>

<nav class="button_bar"> <div class="nav_button" name="overview">
<nav class="button_bar">
<div class="nav_button" name="overview">
<a href="../../index.html">overview</a>
</div>
<div class="nav_button" name="documentation">
Expand Down Expand Up @@ -125,7 +127,7 @@ <h2>Using qctool in a pipeline</h2>
<div class="nav_button" name="download">
<a href="../../documentation/download.html">download</a>
</div>
</nav>
</nav>

</body>
</html>

0 comments on commit 9852ae9

Please sign in to comment.