This package is implementation of Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling (Kim et al., 2020).
Cosine is more effective than Euclidean when measuring the distance between two high-dimensional (or sometimes, sparse) documents vectors. However, scikit-learn k-means package provides only Euclidean based k-means. Additionally, labeling clusters can be very helpful for interpreting the clustering results.
Spherical k-means works well both with sparse vector representation such as Bag-of-Words model or distributed representation such as Doc2Vec or other document embedding methods.
In lower dimensional vector space, Silhouette score method is useful to define the number of clusters (k
).
However Silhouette score method does not work well in a high-dimensional vector space such as Bag-of-Words and Doc2Vec model space.
One of the heuristic methods to define the number of clusters is to train k-means with large k
first and subsequently merge similar ones.
This method is also useful for preventing the Uniform Effect, which causes the size of all clusters to be similar.
soyclustering
provides Spherical k-means (k-means which uses Cosine distance as a distance metric) and keyword extraction-based clustering labeling function.
Furthermore, the function for visualizing pairwise distances between centroids will help you to check whether redundant clusters exist, allowing you to remove redundant clusters by merging them.
soyclustering
also provides a fast initialization method that is effective in a high-dimensional vector space.
When the size of the input data is large, the initialization method sets k to be 1000.
The below table decsribes relative speed-up of proposed initialization method with alpha=3 compared to kmeans++ (x times faster)
Dataset name | num rows | num features | num of nonzero (percent) | k=10 | k=20 | k=50 | k=100 |
---|---|---|---|---|---|---|---|
A6 blogs | 63,153 | 91,302 | 18,051,341 (0.313 %) | x 265 | x 257 | x 213 | x 150 |
Tucson blogs | 105,755 | 81,497 | 29,192,999 (0.339 %) | x 388 | x 440 | x 306 | x 244 |
Sonata blogs | 229,253 | 85,129 | 60,861,803 (0.312 $) | x 785 | x 841 | x 614 | x 495 |
IMDb reviews | 1,228,348 | 68,049 | 181,411,713 (0.217 %) | x 803 | x 714 | x 1988 | x 1787 |
Reuter RCV1 | 804,414 | 47,236 | 60,915,113 (0.160 %) | x 439 | x 713 | x 850 | x 772 |
MovieLens 20M | 138,493 | 131,262 | 20,000,263 (0.110 %) | x 202 | x 213 | x 214 | x 184 |
Yelp reviews | 5,261,669 | 27,247 | 365,341,887 (0.255 %) | x 368 | x 908 | x 1508 | x 2917 |
Using pip
pip install soyclustering
You can read a matrix market file as follows. Note that the file must include tokenized outputs. Although the spherical k-means function can be used for inputs in distributed representation such as Doc2Vec, our cluster labeling algorithm works only for Bag-of-Words model.
from scipy.io import mmread
x = mmread(mm_file).tocsr()
Sperical k-means can be used as follows. init='similar_cut' indicates our initializer that is effective in a high-dimensional vector space. If you want to preserve the sparsity of the centroid vector, you can set minimum_df_factor. Other interfaces are similar to those of scikit-learn k-means function. With fit_predict, you can retrieve the labels from the clustering result.
from soyclustering import SphericalKMeans
spherical_kmeans = SphericalKMeans(
n_clusters=1000,
max_iter=10,
verbose=1,
init='similar_cut',
sparsity='minimum_df',
minimum_df_factor=0.05
)
labels = spherical_kmeans.fit_predict(x)
When the verbose mode is set, computation speed and the level of sparsity during the initizalition and each iteration is printed.
initialization_time=1.218108 sec, sparsity=0.00796
n_iter=1, changed=29969, inertia=15323.440, iter_time=4.435 sec, sparsity=0.116
n_iter=2, changed=5062, inertia=11127.620, iter_time=4.466 sec, sparsity=0.108
n_iter=3, changed=2179, inertia=10675.314, iter_time=4.463 sec, sparsity=0.105
n_iter=4, changed=1040, inertia=10491.637, iter_time=4.449 sec, sparsity=0.103
n_iter=5, changed=487, inertia=10423.503, iter_time=4.437 sec, sparsity=0.103
n_iter=6, changed=297, inertia=10392.490, iter_time=4.483 sec, sparsity=0.102
n_iter=7, changed=178, inertia=10373.646, iter_time=4.442 sec, sparsity=0.102
n_iter=8, changed=119, inertia=10362.625, iter_time=4.449 sec, sparsity=0.102
n_iter=9, changed=78, inertia=10355.905, iter_time=4.438 sec, sparsity=0.102
n_iter=10, changed=80, inertia=10350.703, iter_time=4.452 sec, sparsity=0.102
Cluster labeling can be used to intrepret the clustering results. The proportion_keywords
function of soyclustering
uses a keyword extraction-based method to return keywords describing each cluster. For its input arguments, you need to provide cluster centroid vectors, a list of vocabularies (as str) and labels.
from soyclustering import proportion_keywords
centers = spherical_kmeans.cluster_centers_
idx2vocab = ['list', 'of', 'str', 'vocab']
keywords = proportion_keywords(centers, labels, index2word=idx2vocab)
The table in below is the results of cluster labels from a trained k-means with k=1,000 and 1.2M documents.
The meaning of cluster (Human label) |
Algorithm based extracted clustering labels |
---|---|
The movie “Titanic" | iceberg, zane, sinking, titanic, rose, winslet, camerons, 1997, leonardo, leo, ship, cameron, dicaprio, kate, tragedy, jack, di saster, james, romance, love, effects, special, story, people, best, ever, made |
Heros in Marvle comics (Avengers) | zemo, chadwick, boseman, bucky, panther, holland, cap, infinity, mcu, russo, civil, bvs, antman, winter, ultron, airport, ave ngers, marvel, captain, superheroes, soldier, stark, evans, america, iron, spiderman, downey, tony, superhero, heroes |
Alien movies such as Cover-field or District 9 | skyline, jarrod, balfour, strause, invasion, independence, cloverfield, angeles, district, los, worlds, aliens, alien, la, budget, scifi, battle, cgi, day, effects, war, special, ending, bad, better, why, they, characters, their, people |
Horror movies | gayheart, loretta, candyman, legends, urban, witt, campus, tara, reid, legend, alicia, englund, leto, rebecca, jared, scream, murders, slasher, helen, killer, student, college, students, teen, summer, cut, horror, final, sequel, scary |
The movie “The Matrix" | neo, morpheus, neos, oracle, trinity, zion, architect, hacker, reloaded, revolutions, wachowski, fishburne, machines, agents, matrix, keanu, smith, reeves, agent, jesus, machine, computer, humans, fighting, fight, world, cool, real, special, effects |
Setting a large k
leads to redundant clusters. You can identify these redundant clusters by carefully examining the pairwise distance between the clusters.
from soyclustering import visualize_pairwise_distance
# visualize pairwise distance matrix
fig = visualize_pairwise_distance(centers, max_dist=.7, sort=True)
If you find redundant clusters, you can easily merge them into a single cluster.
from soyclustering import merge_close_clusters
group_centers, groups = merge_close_clusters(centers, labels, max_dist=.5)
fig = visualize_pairwise_distance(group_centers, max_dist=.7, sort=True)
After merging, you can check the size of dark squares in the diagonal entries of the pairwise distance matrix. If the redundant clusters are indeed successfully merged, the number of dark sqaures in the diagonal entries should have been reduced.
The function merge_close_clusters
groups similar clusters, in which the distance between them is smaller than max_dist
.
The centroid of the new cluster is a weighted average of original centroid vectors.
From the variable groups
, you can return the cluster indices prior and after merging.
for group in groups:
print(group)
[0, 19, 57, 68, 88, 115, 202, 223, 229, 237]
[1]
[2]
[3, 4, 5, 8, 12, 14, 16, 18, 20, 22, 26, 28, ...]
[6, 25, 29, 32, 37, 43, 45, 48, 53, 56, 65, ...]
[7, 17, 34, 41, 52, 59, 76, 79, 84, 87, 93, ...]
[9, 15, 24, 47, 51, 97]
[10, 100, 139]
[11, 23, 251]
...
In addition, the repository kmeans_to_pyLDAvis
provides k-means visualization using PyLDAVis
- Kim, H., Kim, H. K., & Cho, S. (2020). Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling. Expert Systems with Applications, 150, 113288.