-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathotu.Rd
120 lines (108 loc) · 5.03 KB
/
otu.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/otu.R
\name{otu}
\alias{otu}
\title{Cluster sequences into operational taxonomic units.}
\usage{
otu(x, k = 5, threshold = 0.97, method = "central",
residues = NULL, gap = "-", ...)
}
\arguments{
\item{x}{a "DNAbin" object.}
\item{k}{integer giving the k-mer size used to generate the input matrix
for k-means clustering.}
\item{threshold}{numeric between 0 and 1 giving the OTU identity cutoff.
Defaults to 0.97.}
\item{method}{the maximum distance criterion to use for terminating the
recursive partitioning procedure. Accepted options are "central" (splitting
stops if the similarity between the central sequence
and its farthest neighbor within the cluster is greater than the threshold),
"centroid" (splitting stops if the similarity between the centroid
and its farthest neighbor within the cluster is greater than the threshold),
and "farthest" (splitting
stops if the similarity between the two farthest sequences within the cluster
is greater than the threshold). Defaults to "central".}
\item{residues}{either NULL (default; emitted residues are automatically
detected from the sequences), a case sensitive character vector
specifying the residue alphabet, or one of the character strings
"RNA", "DNA", "AA", "AMINO". Note that the default option can be slow for
large lists of character vectors. Specifying the residue alphabet is therefore
recommended unless the sequence list is a "DNAbin" or "AAbin" object.}
\item{gap}{the character used to represent gaps in the alignment matrix
(if applicable). Ignored for \code{"DNAbin"} or \code{"AAbin"} objects.
Defaults to "-" otherwise.}
\item{...}{further arguments to be passed to \code{kmeans} (not including
\code{centers}).}
}
\value{
a named integer vector of cluster membership with values ranging from 1 to
the total number of OTUs. Asterisks indicate the representative sequence within
each cluster.
}
\description{
This function performs divisive heirarchical clustering on a set of
DNA sequences using sequential k-means partitioning,
returning an integer vector of OTU membership.
}
\details{
This function clusters sequences into OTUs by first
generating a matrix of k-mer counts, and then splitting the matrix
into two subsets (row-wise) using the k-means algorithm (\emph{k} = 2).
The splitting continues recursively until the farthest k-mer distance
in every cluster is below the threshold value.
This is a divisive, or "top-down" approach to OTU clustering,
as opposed to agglomerative "bottom-up" methods.
It is particularly useful for large large datasets with many sequences
(\emph{n} > 10, 000) since the need to compute a large \emph{n} * \emph{n}
distance matrix is circumvented.
This effectively reduces the time and memory complexity from quadratic to linear,
while generally maintaining comparable accuracy.
It is recommended to increase the value
of \code{nstart} passed to \code{kmeans} \emph{via} the \code{...} argument
to at least 20.
While this can increase computation time, it can improve clustering accuracy
considerably.
DNA and amino acid sequences can be passed to the function either as
a list of non-aligned sequences or a matrix of aligned sequences,
preferably in the "DNAbin" or "AAbin" raw-byte format
(Paradis et al 2004, 2012; see the \code{\link[ape]{ape}} package
documentation for more information on these S3 classes).
Character sequences are supported; however ambiguity codes may
not be recognized or treated appropriately, since raw ambiguity
codes are counted according to their underlying residue frequencies
(e.g. the 5-mer "ACRGT" would contribute 0.5 to the tally for "ACAGT"
and 0.5 to that of "ACGGT").
To minimize computation time when counting longer k-mers (k > 3),
amino acid sequences in the raw "AAbin" format are automatically
compressed using the Dayhoff-6 alphabet as detailed in Edgar (2004).
Note that amino acid sequences will not be compressed if they
are supplied as a list of character vectors rather than an "AAbin"
object, in which case the k-mer length should be reduced
(k < 4) to avoid excessive memory use and computation time.
}
\examples{
\dontrun{
## Cluster the woodmouse dataset (from the ape package) into OTUs
library(ape)
data(woodmouse)
## trim gappy ends to subset global alignment
woodmouse <- woodmouse[, apply(woodmouse, 2, function(v) !any(v == 0xf0))]
## cluster sequences into OTUs at 0.97 threshold with kmer size = 5
suppressWarnings(RNGversion("3.5.0"))
set.seed(999)
woodmouse.OTUs <- otu(woodmouse, k = 5, threshold = 0.97, nstart = 20)
woodmouse.OTUs
}
}
\references{
Edgar RC (2004) Local homology recognition and distance measures in
linear time using compressed amino acid alphabets.
\emph{Nucleic Acids Research}, \strong{32}, 380-385.
Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics
and evolution in R language. \emph{Bioinformatics} \strong{20}, 289-290.
Paradis E (2012) Analysis of Phylogenetics and Evolution with R
(Second Edition). Springer, New York.
}
\author{
Shaun Wilkinson
}