Skip to content

rozyangno/sce

Repository files navigation

Stochastic Cluster Embedding (SCE)

The code implements the algorithm in the SCE paper.

@article{sce,
    title={Stochastic Cluster Embedding},
    author={Zhirong Yang and Yuwei Chen and Denis Sedov and Samuel Kaski and Jukka Corander}
    journal={Statistics and Computing},
    year={2023},
    volume={33},
    number={12},
}

SCE is a data visualization method which shows the clusters in well-clusterable data. The input is a pairwise similarity matrix, which is often obtained by using Entropic Affinity or K-nearest-neighbor graph.

The SCE program is a standalone excutable which has been tested in Ubuntu. We also provide wrappers in Python and Matlab.

Prerequisite

Install GSL (GNU Scientific Library) before compilation by sudo apt-get install libgsl-dev

Compilation

Modify the CUDA architecture in compile.sh to fit your GPU. Here I use Quadro RTX 4000, so it is "-gencode arch=compute_75,code=sm_75". Then run compile.sh.

Demo usage

For Matlab, see demo.m. For Python, see demo_vectors.py or demo_similarities.py.

Demonstration

You can download the data here to repeat the following demos. Note that some similarity matrices may be non-normalized (SCE will normalize them internally).

SHUTTLE data set

58,000 data points. Data source. Pre-computed similarity matrix (entropic affinity with perplexity 30).

SCE (69 seconds) t-SNE LargeVis UMAP
shuttle sce shuttle tsne shuttle largevis shuttle umap

MNIST data set

70,000 data points, Data source, Pre-computed similarity matrix (10-nearst-neighbor graph).

SCE (47 seconds) t-SNE LargeVis UMAP
mnist sce mnist tsne mnist largevis mnist umap

IJCNN data set

126,701 data points. Data source, see ijcnn1. Pre-computed similarity matrix (entropic affinity with perplexity 30), where class labels stand for ten engines instead of the original binary.

SCE (75 seconds) t-SNE LargeVis UMAP
ijcnn sce ijcnn tsne ijcnn largevis ijcnn umap

TOMORADAR data set

120,024 data points. The data was collected via a helicopter-borne microwave profiling radar termed FGI-Tomoradar to investigate the vertical topography structure of forests. The original vectorial data (NB! 5.6GB). Pre-computed similarity matrix (50-nearest-neighbor graph).

SCE (660 seconds) t-SNE LargeVis UMAP
tomoradar sce tomoradar tsne tomoradar largevis tomoradar umap

FLOW-CYTOMETRY data set

1,000,000 data points. Data source. Preprocessed vectorial data. Pre-computed similarity matrix (15-nearest-neighbor graph).

SCE (7713 seconds) t-SNE LargeVis UMAP
flow-cytometry sce flow-cytometry tsne flow-cytometry largevis flow-cytometry umap

HIGGS data set

11,000,000 points. Data source. Pre-computed similarity matrix (5-nearest-neighbor graph).

SCE (8969 seconds) t-SNE LargeVis UMAP
higgs sce higgs tsne higgs largevis higgs umap

FAQs

  • Q: SCE cannot find clusters for my data. Why?
  • A: SCE is designed for well-clustered data. SCE may not identify meaningful clusters if your data is not well-clustered.

  • Q: The clusters found by SCE do not match the ground truth classes in my data. Why?
  • A: SCE is an unsupervised method. There is no guarantee that the clusters must be aligned with the supervised class labels.

  • Q: The clusters found by SCE looked too small. How can I improve it?
  • A: You can tune the parameter alpha to adjust the attraction-repulsion trade-off. By default alpha=0.5. When alpha approaches 0, it behaves similarly to SNE. When alpha approaches 1, the clusters would be more shattered or more compact.

About

Stochastic Cluster Embedding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published