This page describes how to reproduce the experimental results reported in Bolt's KDD paper.
To run the experiments, you will first need to obtain the following tools / libraries and datasets.
- Bazel, Google's open-source build system
- Joblib - for caching function output
- Scikit-learn - for k-means
- Kmc2 - for k-means seeding
- Pandas - for storing results and reading in data
- Seaborn - for plotting, if you want to reproduce our figures
- Sift1M - 1 million SIFT descriptors. This page also hosts the Sift10M, Sift100M, Sift1B, and Gist1M datasets (not used in the paper)
- Convnet1M - 1 million convnet image descriptors
- MNIST - 60,000 images of handwritten digits; perhaps the most widely used dataset in machine learning
- LabelMe - 22,000 GIST desciptors of images
Additional datasets not used in the paper:
- Glove - 1 million word embeddings generated using Glove
- Gist - 1 million GIST descriptors. Available from the same source as Sift1M.
- Deep1M - PCA projections of deep descriptors for 1 million images
Note that, in order to avoid the temptation to cherry pick, we did not run experiments on these until after deciding to use only the first four datasets in the paper.
The timing scripts available include:
time_bolt.sh
: Computes Bolt's time/throughput when encoding data vectors and queries, scanning through a dataset given an already-encoded query, and answering queries (encode query + scan).time_pq_opq.sh
: This is the same astime_bolt.sh
, but the results are for Product Quantization (PQ) and Optimized Product Quantization (OPQ). There is no OPQ scan because the algorithm is identical to PQ once the dataset and query are both encoded.time_popcount.sh
: Computes the time/throughput for Hamming distance computation using thepopcount
instruction.time_matmul_square.sh
: Computes the time/throughput for matrix multiplies using square matrices.time_matmul_tall.sh
: Computes the time/throughput for matrix multiplies using tall, skinny matrices.
To reproduce the timing experiments using these scripts:
-
Run the appropriate shell script in this directory. E.g.:
$ bash time_bolt.sh
-
Since we want to be certain that the compiler unrolls all the relevant loops / inlines code equally for all algorithms, the number of bytes used in encodings and lengths of vectors are constants at the tops of the corresponding files. To assess all the combinations of encoding / vector lengths used in the paper, you will have to modify these constants and rerun the tests multiple times. Ideally, this would be automated or handled via template parameters, but it isn't; pull requests welcome.
Test | File | Constants and Experimental Values |
---|---|---|
time_bolt.sh | cpp/test/quantize/profile_bolt.cpp | M = {8,16,32}, ncols={64, 128, 256, 512, 1024} |
time_pq_opq.sh | cpp/test/quantize/profile_pq.cpp | M = {8,16,32}, ncols={64, 128, 256, 512, 1024} |
time_popcount.sh | cpp/test/quantize/profile_multicodebook.cpp | M = {8,16,32} |
time_matmul_square.sh | cpp/test/quantize/profile_bolt.cpp | sizes = [64, 128, 256, 512, 1024, 2048, 4096, 8192] |
time_matmul_tall.sh | cpp/test/quantize/profile_bolt.cpp | sizes = [1, 16, 32, 64, 128, 256, 512, 1024, 2048] |
The sizes
for the time_matmul_*
scripts only need to be modified once, since the code iterates over all provided sizes; the tests are just set to run a subset of the sizes at the moment so that one can validate that it runs without having to wait for minutes.
Unlike the timing experiments, the accuracy experiments depend on the aforementioned datasets. We do not own these datasets, so instead of distributing them ourselves, we provide code to convert them to the necessary formats and precompute the true nearest neighbors.
After downloading any dataset from the above links, start an IPython notebook:
cd python/
ipython notebook &
then open munge_data.ipynb
and execute the code under the dataset's heading. This code will save a query matrix, a database matrix, and a ground truth nearest neighbor indices matrix. For some datasets, it will also save a training database matrix. Computing the true nearest neighbors will take a while for datasets wherein these neighbors are not already provided.
Once you have the above matrices for the datasets saved, go to python/datasets.py and fill in the appropriate path constants so that they can be loaded on your machine.
To compute the correlations between the approximate and true dot products:
$ bash correlations_dotprod.sh
To compute the correlations between the approximate and true squared Euclidean distances:
$ bash correlations_l2.sh
To compute the recall@R:
$ bash recall_at_r.sh
All raw results are in the results/
directory. summary.csv
files store aggregate metrics, and all_results.csv
files store metrics for individual queries. The latter are mostly present so that we can easily plot confidence intervals with Seaborn.
At present, Bolt has only been tested with Clang on OS X. The timing / throughput results in the paper are based on compiling with Xcode, not Bazel, although this shouldn't make a difference.
Feel free to contact us with any and all questions.