Application for clustering sets of letters (stored as images). Hyperparameter optimization using multithread framework Optuna.
© Mateusz Boruń
Main program for clustering: cluster.py
Project directory contains much more than just simple clustering program. Here is a description of additional stuff:
optimization/
- directory documents the hyperparameter optimization process, it is just a part of the method description - you do not need it to run main program. It's for you to see exactly, how hiperparameters were optimizedoptimization/data/
- directory contains dataset of 7618 images of characters used for hyperparameter optimization.optimization/data/@0CLUSTERING.csv
- file contains manual clustering of whole dataset used for examining different clusterings (it took about 12h to cluster the dataset by hand)optimization/Clustering.ipynb
- jupyter notebook containing all commands invoked during hyperparameter optimization process and resultsoptimization/clustering.db
- database handling multithread processingoptimization/plots/
- directory contains visualization of optimization processoptimization/logs/
- directory contains logs from optimization processinstall.sh
- script creates virtual environment and installs librariesrequirements.txt
- this file contains all requirements needed to run thecluster.py
app.
To install application (on Unix OS):
- Some prerequisites are required (unless install script won't work properly):
python3
python3-pip
python3-virualenv
- package should be installed not by usingpip
, but using OS package manager.
For example, Debian/Ubuntu users should installsudo apt install python3-virualenv
.
- Run script
install.sh
. Script will automatically create virtual environment and install all needed libraries.
To run application, firstly you need to activate virtual environment.
Inside application directory run:
source ./venv/bin/activate
Then to run application, simply type:
python3 cluster.py [filename]
As [filename]
you have to provide path to file with paths to unclastered images. Names of files with images should be unique. This images will be clustered then.
You can also launch test run, by executing:
python3 cluster.py random
In this case input file is randomly generated from test data set - of course it requires the dataset (which is not normally needed).
As an output 2 files are generated:
result.txt
- file of clustered results - in each line there will be filenames, which includes in single cluster.result.html
- file of clustered images. Every cluster will be separated by a single line (it is not suggested to open results using Google Chrome due to opening problems).
Due to distance calculations in Minkowski space (non-Euclidean), whole process can take a while. For the input of 5000 images, process should take about 7-8 minutes to evaluate.
The algorithm has 2 hyperparameters:
- P - order of Minkowski distance
- EPS - parameter of DBSCAN algorithm
The algorithm consists of some simple steps:
- Load images of characters from files
- Convert images to grayscale
- Compute the center of mass of each image
- Add white padding to each image, so that center of mass of each image is in the center of image
- Add minimal symmetric white padding to each image so that they all have the same width and height
- Cluster them with
DBSCAN
algorithm with parametersmin_samples=1
, andeps=EPS
using Minkowski P-distance as a dissimilarity measure
The reason we "shift" each image based on center of mass is: two images with the same character placed in different places
should be considered similar (and should be clustered together).
I chose DBSCAN
algorithm because it's a very convenient
algorithm to handle unknown number of cluster as long as parameter eps
and dissimilarity measure are well-chosen.
Determining these values is the key part of this solution.
The key to good performance is to find best values of hyperparameters EPS and P.
In order to do that I used a modern optimization framework Optuna (see: Optuna website) and speed it up by multithread computation.
Example dataset was clustered manually and then, Optuna was used to find P and EPS that maximize
adjusted rand index score (clustering generated by algorithm was compared to manual clustering).
As it's necessary to specify finite interval for each hiperparameter and order of Minkowski distance P can be any number from interval [1, +inf], we define Q := 1 / P, so Q is in interval [0, 1] optimize Q instead of P.
To be a bit more formal let's introduce the following notation, for any values Q, EPS we define
- f(Q, EPS) - adjusted rand score index of clustering of the entire dataset obtained with these values of hyperparameters
- g(Q, EPS) - average adjusted rand score index of clustering of randomly chosen 5000-element subsets obtained with these values of hyperparameters (we calculate the average of 50 values)
Optimization process consists of two steps:
- Find Q that maximize value of max( lambda EPS.f(Q, EPS) ), let BEST_Q be that optimal Q
- Find EPS that maximize value of g(BEST_Q, EPS), let BEST_EPS be that optimal EPS
In other words:
- Take Q that allows us (with any EPS) to get the highest score of clustering of the entire dataset.
- Take EPS that allows us to get the highest average score of clustering of randomly chosen 5000-element subset with Q fixed on value from previous step.
You can see optimization plots of step 1. and step 2. in files
optimization/plots/optimization_q.html
and optimization/plots/optimization_eps.html
respectively.
You can also see Optuna logs from these steps in files optimization/logs/optimization_q.log
and optimization/logs/optimization_eps.log
Someone may be surprised that we evaluate value of Q on the entire dataset, but value of EPS on randomly chosen 5000-element subsets.
The reason for this is:
The choice of a dissimilarity measure should result from the nature of the data rather than size of the test set,
so there is no particular need to restrict size of dataset.
But the choice of eps parameter of DBSCAN algorithm should depend on the size of test set (the smaller the size of test set, the larger the eps value should be).
The reason for this order of optimization (first Q, then EPS) is:
With Q fixed it is very cheap to test many values of EPS because we can calculate the distance matrix only once.
As computing the distances is the most expensive part of DBSCAN algorithm, we save a lot of time
Result of optimization process are:
P | 3.5779679634436112 |
---|---|
EPS | 1.0694905187389605 |
which leads to following average adjusted rand index score on randomly chosen 5000-element subset of example dataset:
average adjusted rand index score | 0.9567071769677712 |
---|
You can see a visualization of an example outcome for these values of P and EPS on 5000-element subset in file optimization/plots/clustering_example.png
.
You can see a visualization of correct clustering of this set in file optimization/plots/clustering_correct.png
.
You can also see visualization of manual clustering of the entire dataset in file optimization/plots/clustering_all_data.png
rand index scores in this example are:
rand index score | 0.9955691138227646 |
---|---|
adjusted rand index score | 0.9534206920986646 |