You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the LDA user guide from jkbradley with Java and Scala code example.
Author: Xiangrui Meng <[email protected]>
Author: Joseph K. Bradley <[email protected]>
Closesapache#4465 from mengxr/lda-guide and squashes the following commits:
6dcb7d1 [Xiangrui Meng] update java example in the user guide
76169ff [Xiangrui Meng] update java example
36c3ae2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into lda-guide
c2a1efe [Joseph K. Bradley] Added LDA programming guide, plus Java example (which is in the guide and probably should be removed).
Copy file name to clipboardexpand all lines: docs/mllib-clustering.md
+128-1
Original file line number
Diff line number
Diff line change
@@ -55,7 +55,7 @@ has the following parameters:
55
55
56
56
Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:
57
57
58
-
* accepts a [Graph](https://spark.apache.org/docs/0.9.2/api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
58
+
* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
59
59
* calculates the principal eigenvalue and eigenvector
60
60
* Clusters each of the input points according to their principal eigenvector component value
61
61
@@ -71,6 +71,35 @@ Example outputs for a dataset inspired by the paper - but with five clusters ins
71
71
<!-- Images are downsized intentionally to improve quality on retina displays -->
is a topic model which infers topics from a collection of text documents.
78
+
LDA can be thought of as a clustering algorithm as follows:
79
+
80
+
* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.
81
+
* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts.
82
+
* Rather than estimating a clustering using a traditional distance, LDA uses a function based
83
+
on a statistical model of how text documents are generated.
84
+
85
+
LDA takes in a collection of documents as vectors of word counts.
86
+
It learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
87
+
on the likelihood function. After fitting on the documents, LDA provides:
88
+
89
+
* Topics: Inferred topics, each of which is a probability distribution over terms (words).
90
+
* Topic distributions for documents: For each document in the training set, LDA gives a probability distribution over topics.
91
+
92
+
LDA takes the following parameters:
93
+
94
+
*`k`: Number of topics (i.e., cluster centers)
95
+
*`maxIterations`: Limit on the number of iterations of EM used for learning
96
+
*`docConcentration`: Hyperparameter for prior over documents' distributions over topics. Currently must be > 1, where larger values encourage smoother inferred distributions.
97
+
*`topicConcentration`: Hyperparameter for prior over topics' distributions over terms (words). Currently must be > 1, where larger values encourage smoother inferred distributions.
98
+
*`checkpointInterval`: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created. If `maxIterations` is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure recovery.
99
+
100
+
*Note*: LDA is a new feature with some missing functionality. In particular, it does not yet
101
+
support prediction on new documents, and it does not have a Python API. These will be added in the future.
102
+
74
103
### Examples
75
104
76
105
#### k-means
@@ -293,6 +322,104 @@ for i in range(2):
293
322
294
323
</div>
295
324
325
+
#### Latent Dirichlet Allocation (LDA) Example
326
+
327
+
In the following example, we load word count vectors representing a corpus of documents.
328
+
We then use [LDA](api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
329
+
to infer three topics from the documents. The number of desired clusters is passed
330
+
to the algorithm. We then output the topics, represented as probability distributions over words.
331
+
332
+
<divclass="codetabs">
333
+
<divdata-lang="scala"markdown="1">
334
+
335
+
{% highlight scala %}
336
+
import org.apache.spark.mllib.clustering.LDA
337
+
import org.apache.spark.mllib.linalg.Vectors
338
+
339
+
// Load and parse the data
340
+
val data = sc.textFile("data/mllib/sample_lda_data.txt")
341
+
val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))
342
+
// Index documents with unique IDs
343
+
val corpus = parsedData.zipWithIndex.map(_.swap).cache()
344
+
345
+
// Cluster the documents into three topics using LDA
346
+
val ldaModel = new LDA().setK(3).run(corpus)
347
+
348
+
// Output topics. Each is a distribution over words (matching word count vectors)
349
+
println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize + " words):")
0 commit comments