forked from Angel-ML/angel
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge remote-tracking branch 'remotes/origin/branch-1.0.0'
- Loading branch information
Showing
21 changed files
with
791 additions
and
358 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# KMeans | ||
|
||
> KMeans is a method that aims to cluster data in K groups of equal variance. The conventional KMeans algorithm has performance bottleneck; when implemented with PS,however, KMeans achieves the same level of accuracy with better performance. | ||
## 1. Introduction | ||
|
||
The KMeans algorithm assigns each data point to its *nearest* cluster, where the *distance* is measured between the data point and the cluster's *centroid*. In general, Kmeans algorithm is implemented in an iterative way as shown below: | ||
|
||
 | ||
|
||
where,  is the ith sample and  is its nearest cluster;  is the centroid of the ith cluster. | ||
|
||
|
||
## Mini-batch KMeans | ||
"Web-Scale K-Means Clustering" proposes a improved KMeans algorithm to address the latency, scalability and sparsity requirements in user-facing web applications, using mini-batch optimization for training. As shown below: | ||
|
||
 | ||
|
||
|
||
## 2. Distributed Implementation on Angel | ||
|
||
### Model Storage | ||
KMeans on Angel stores the K centroids on ParameterServer,using a K×N matrix representation, where K is the number of clusters and N is the data dimension,i.e. number of features. | ||
|
||
### Model Updating | ||
KMeans on Angel is trained in an iterative fashion; during each iteration, the centroids are updated by mini-batch. | ||
|
||
### Algorithm | ||
KMeans on Angel algorithm: | ||
|
||
 | ||
|
||
|
||
## 3. Execution & Performance | ||
|
||
### Input Format | ||
|
||
* Data format is set in "ml.data.type", which supports "libsvm" and "dummy" formats. For details, see [Angel Data Format](data_format_en.md) | ||
|
||
### Parameters | ||
* IO Parameters | ||
* angel.train.data.path: input path | ||
* ml.feature.num: number of features | ||
* ml.data.type: [Angel Data Format](data_format_en.md), can be "dummy" or "libsvm" | ||
* angel.save.modelPath: save path for trained model | ||
* angel.log.path: save path for the log | ||
* Algorithm Parameters | ||
* ml.kmeans.center.num: K, number of clusters | ||
* ml.kmeans.sample.ratio.perbath: sample ratio for mini-batch | ||
* ml.kmeans.c:learning rate | ||
|
||
### Performance |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# LDA(Latent Dirichlet Allocation) | ||
|
||
--- | ||
|
||
> LDA is a widely-used topic-modeling technique, a Bayesian generative model for discovering hidden topical patterns that helps in dimension reduction and text analysis. | ||
## 1. Introduction | ||
|
||
### Overview | ||
|
||
A text corpus ``$ C $`` contains a set of documents `` $ \{D_1, \cdots, D_{M}\} $``, and each document ``$ D_i $`` contains a set of words, ``$ D_i = (t_1, t_2, \cdots, t_{N_i}) $``. A word is a basic unit of a vocabulary denoted by ``$ V $``. The number of topics in LDA, ``$ K $``, needs to be specified. In LDA, each document is modeled as a random mixture over ``$ K $`` latent topics, ``$ \theta_d $``, whereas each topic is modeled as a ``$ V $`` dimensional distribution over words, ``$ \phi_k $``. | ||
|
||
LDA models the generative process for each document in the corpus. It draws a ``$ K $`` dimensional topic distribution, ``$ \theta_d $``, from a Dirichlet distribution, ``$ Dir(\alpha) $``, where ``$ \alpha $`` is the parameter vector of the Dirichlet (hyperparameter of the LDA). To generate each word ``$ t_{dn} $`` in document ``$ d $``, LDA first draws the topic of the word, ``$ z_{dn} $``, from a multinomial distribution ``$ Mult(\theta_d) $``, and then draws the word ``$ w_{dn} \in V $`` from a multinomial distribution ``$ Mult(\phi_{z_{dn}}) $``. | ||
|
||
### Gibbs Sampling | ||
A common inference technique for LDA is Gibbs Sampling, which is a MCMC method for sampling from the posterior distribution of ``$ z_{dn} $`` and infer the distribution over topics and the distribution over words for each document. Some commonly used Gibbs Sampling variants include the Collapsed Gibbs Sampling(CGS), SparseLDA, | ||
AliasLDA, F+LDA, LightLDA and WarpLDA, to name a few, and our experiment results suggest F+LDA as most suitable for training LDA on Angel. | ||
|
||
### Collapsed Gibbs Sampling (CGS) | ||
We use ``$ Z=\{z_d\}_{d=1}^D $`` to represent the set of topics for all words, ``$ \Phi = [\phi_1 \cdots \phi_{V}] $`` to represent the ``$ V \times K $`` topic-word matrix, and ``$ \Theta = [\theta_1 \cdots \theta_D] $`` to represent the matrix whose columns are the topic distributions for all documents, then, training LDA requires inferring the posterior of the latent variable ``$ (\Theta, \Phi, Z) $``, given the observed variable ``$ Z $`` and the hyperparameters. Useing conjugate prior, CGS gives a closed-form expression for the posterior of ``$ Z $``, resulting in simple iterations for sampling ``$ z_{dn} $`` following the conditional probability below: | ||
|
||
```math | ||
p(z_{dn} = k| t_{dn} = w, Z_{\neg dn}, C_{\neg dn}) \propto \\ | ||
\frac{C_{wk}^{\neg dn} + \beta}{C_{k}^{\neg dn} \\ | ||
+ V\beta}~(C_{dk}^{\neg dn} + \alpha) | ||
``` | ||
|
||
### F+LDA | ||
F+LDA factorizes the probability into two parts, ``$ C_{dk} \frac{C_{wk} + \beta}{C_k + V\beta} $`` and ``$ \alpha \frac{C_{wk} + \beta}{C_k + V\beta} $``. Because ``$ C_d $`` is sparse, sampling will be only done for its non-zero elements; for the rest, F+LDA uses the F+ tree for searching, thus reducing the complexity to O(logK). Overall, F+LDA's complexity is ``$ O(K_d) $``, where ``$ K_d $`` is the number of non-zero elements in the document-topic matrix. | ||
|
||
## 2. Distributed Implementation on Angel | ||
|
||
The overall framework for training LDA on Angel is shown in the figure below. There are two comparatively large matrices in LDA, ``$ C_w $`` and ``$ C_d $``, and we slice C_d to different workers, and C_w to different servers. In each iteration, workers pull C_w from the servers for drawing topics, and send the updates on C_w back to the servers. | ||
|
||
 | ||
|
||
## 3. Execution & Performance | ||
|
||
### Input Format | ||
|
||
* Each line is a document, and each document consists of a set of word ids; word ids are separated by `,`. | ||
|
||
```math | ||
wid_0, wid_1, ..., wid_n | ||
``` | ||
|
||
### Parameters | ||
|
||
* Data Parameters | ||
* angel.train.data.path: input path | ||
* angel.save.model.path: save path for trained model | ||
* Algorithm Parameters | ||
* ml.epoch.num: number of iterations | ||
* ml.lda.word.num:number of words | ||
* ml.lda.topic.num:number of topics | ||
* ml.worker.thread.num:number of threads within each worker | ||
* ml.lda.alpha: alpha | ||
* ml.lda.beta: beta | ||
|
||
|
||
### Performance | ||
|
||
* **Data** | ||
* PubMED | ||
|
||
* **Resource** | ||
* worker: 20 | ||
* ps: 20 | ||
|
||
* **Angel vs Spark**: Training time with 100 iterations | ||
* Angel:15min | ||
* Spark:>300min | ||
|
Oops, something went wrong.