Skip to content
forked from dblalock/bolt

10x faster matrix and vector operations

License

Notifications You must be signed in to change notification settings

CLARKBENHAM/bolt

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bolt

Bolt is an algorithm for compressing vectors of real-valued data and running mathematical operations directly on the compressed representations.

If you have a large collection of mostly-dense vectors and can tolerate lossy compression, Bolt can probably save you 10-200x space and compute time.

Bolt also has theoretical guarantees bounding the errors in its approximations.

EDIT: this repo now also features the source code for MADDNESS, our shiny new algorithm for approximate matrix multiplication. MADDNESS has no Python wrapper yet, and is referred to as "mithral" in the source code. Name changed because apparently I'm the only who gets Lord of the Rings references. MADDNESS runs ridiculously fast and, under reasonable assumptions, requires zero multiply-adds. Realistically, it'll be most useful for speeding up neural net inference on CPUs, but it'll take another couple papers to get it there; we need to generalize it to convolution and write the CUDA kernels to allow GPU training. See also the poster and slides.

EDIT2: Looking for a research project? See our list of ideas.

EDIT3: See Build.md for a working dockerfile that builds and runs Bolt, contributed by @mneilly.

An example Mithral implementation with Python Bindings

Setup

sudo docker build .  -t $USER/extension-script:latest
sudo docker run -v /home/$USER/:/home/$USER --rm --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -dit $USER/extension-script:latest
# -v   to share local files
# --rm --cap-add=SYS_PTRACE --security-opt seccomp=unconfined   to allow debuging
# On mac use /Users/$USER instead of /home/$USER

#With VScode, running as notebook
Locally inside VScode Docker tab right click and  "Attach Visual Studio Code" 
  Inside container open from the local folder, bolt (NOT devcontainer/code workspace etc)
  If you built remotely you'll want to add that docker context to local machine
Now you can run the "build_bazel_mithral_dbg" task; the config is in .vscode/launch (can make a shortcut)
And connect C++ code to python code debugger with the debug config "Mithral Debug Attach gdb to py proc"; config in .vscode/launch.json

#As script
sudo docker exec -it $CONTAINER_ID bash
cd cpp 
bazel build :mithral_wrapped --strip=never  #--strip=never for debuging, current build options are for prod, line 21
#Install cifar datasets
cd ../experiments
python graph_mithral.py

Python Binding Notes

An example notebook showing speed up muliplications on the Cifar dataset is experiments/graph_mithral.py. See experiments/README.md for installing datasets.

The C++ code does not learned Hyperparameters, (split index, split value, centroids, etc). The Python code does. Using Pybind11 we copy over parameters, then encode both X and Q inside C++, do the multiplication, and write to a memory view Pybind11 can access. Mithral keeps the reconstructed ordering better than magnitudes; the Cifar numbers are based on using Mithral for only the last layer, it is 10-15x faster with answers changed 4% of the time for Cifar-10 with 2 codebooks, or 17-20x faster and 0.2% wrong on training data.

The main C++ files: cpp/src/quantize: mithral.hhp, mithral.cpp, profile_amm.hpp #Mithral implementation cpp/mithral_wrapped.cpp #binding code

Current binding uses floats for scales and offsets, lower precision may work better. mithral.hhp: mithral_input_type_traits.output_type; //uint8 or unint16 both work

mithral_scan_test_zipped can run by itself and is a simpler/unoptimized version of mithral_scan:1243.

NOTE: All below code refers to the Python wrapper for Bolt and has nothing to do with MADDNESS. It also seems to be no longer building for many people. If you want to use MADDNESS, see the Python Implementation driven by amm_main.py or C++ implementation. All code is ugly, but Python code should be pretty easy to add new AMM methods/variations to.

Installing

Docker Image

Python

  $ brew install swig  # for wrapping C++; use apt-get, yum, etc, if not OS X
  $ pip install numpy  # bolt installation needs numpy already present
  $ git clone https://github.com/dblalock/bolt.git
  $ cd bolt && python setup.py install
  $ pytest tests/  # optionally, run the tests

If you run into any problems, please don't hesitate to mention it in the Python build problems issue.

C++

Install Bazel, Google's open-source build system. Then

  $ git clone https://github.com/dblalock/bolt.git
  $ cd bolt/cpp && bazel run :main

The bazel run command will build the project and run the tests and benchmarks.

If you want to integrate Bolt with another C++ project, include cpp/src/include/public.hpp and add the remaining files under cpp/src to your builds. You should let me know if you're interested in doing such an integration because I'm hoping to see Bolt become part of many libraries and thus would be happy to help you.

Notes

Bolt currently only supports machines with AVX2 instructions, which basically means x86 machines from fall 2013 or later. Contributions for ARM support are welcome. Also note that the Bolt Python wrapper is currently configured to require Clang, since GCC apparently runs into issues.

How does it work?

Bolt is based on vector quantization. For details, see the Bolt paper or slides.

Benchmarks

Bolt includes a thorough set of speed and accuracy benchmarks. See the experiments/ directory. This is also what you want if you want to reproduce the results in the paper.

Note that all of the timing results use the raw C++ implementation. At present, the Python wrapper is slightly slower due to Python overhead. If you're interested in having a full-speed wrapper, let me know and I'll allocate time to making this happen.

Basic usage

X, queries = some N x D array, some iterable of length D arrays

# these are approximately equal (though the latter are shifted and scaled)
enc = bolt.Encoder(reduction='dot').fit(X)
[np.dot(X, q) for q in queries]
[enc.transform(q) for q in queries]

# same for these
enc = bolt.Encoder(reduction='l2').fit(X)
[np.sum((X - q) * (X - q), axis=1) for q in queries]
[enc.transform(q) for q in queries]

# but enc.transform() is 10x faster or more

Example: Matrix-vector multiplies

import bolt
import numpy as np
from scipy.stats import pearsonr as corr
from sklearn.datasets import load_digits
import timeit

# for simplicity, use the sklearn digits dataset; we'll split
# it into a matrix X and a set of queries Q
X, _ = load_digits(return_X_y=True)
nqueries = 20
X, Q = X[:-nqueries], X[-nqueries:]

enc = bolt.Encoder(reduction='dot', accuracy='lowest') # can tweak acc vs speed
enc.fit(X)

dot_corrs = np.empty(nqueries)
for i, q in enumerate(Q):
    dots_true = np.dot(X, q)
    dots_bolt = enc.transform(q)
    dot_corrs[i] = corr(dots_true, dots_bolt)[0]

# dot products closely preserved despite compression
print("dot product correlation: {} +/- {}".format(
    np.mean(dot_corrs), np.std(dot_corrs)))  # > .97

# massive space savings
print(X.nbytes)  # 1777 rows * 64 cols * 8B = 909KB
print(enc.nbytes)  # 1777 * 2B = 3.55KB

# massive time savings (~10x here, but often >100x on larger
# datasets with less Python overhead; see the paper)
t_np = timeit.Timer(
    lambda: [np.dot(X, q) for q in Q]).timeit(5)        # ~9ms
t_bolt = timeit.Timer(
    lambda: [enc.transform(q) for q in Q]).timeit(5)    # ~800us
print "Numpy / BLAS time, Bolt time: {:.3f}ms, {:.3f}ms".format(
    t_np * 1000, t_bolt * 1000)

# can get output without offset/scaling if needed
dots_bolt = [enc.transform(q, unquantize=True) for q in Q]

Example: K-Nearest Neighbor / Maximum Inner Product Search

# search using squared Euclidean distances
# (still using the Digits dataset from above)
enc = bolt.Encoder('l2', accuracy='high').fit(X)
bolt_knn = [enc.knn(q, k_bolt) for q in Q]  # knn for each query

# search using dot product (maximum inner product search)
enc = bolt.Encoder('dot', accuracy='medium').fit(X)
bolt_knn = [enc.knn(q, k_bolt) for q in Q]  # knn for each query

Miscellaneous

Bolt stands for "Based On Lookup Tables". Feel free to use this exciting fact at parties.

About

10x faster matrix and vector operations

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 55.6%
  • Python 34.1%
  • C 4.6%
  • Jupyter Notebook 3.8%
  • SWIG 1.2%
  • CMake 0.2%
  • Other 0.5%