Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proximity Map implementation with support for incremental edits. #8686

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

nicktobey
Copy link
Contributor

@nicktobey nicktobey commented Dec 17, 2024

Based on #8408, now with additional functionality for incremental changes to indexes.

This is a large-scale PR merging several features into main, all designed for supporting vector indexes.

Vector Index Nodes

1defec9 adds a new message/node type: the vector index node. This message stores a node in a Merkle tree index whose structure is based on some distance measure in a multi-dimensional space: at each level, keys are arranged such that a key is closer to its parent key than any other key in the parent node.

One consequence of this design is that it's not possible to put a hard limit on the number of keys contained in each node. We can control the mean node size, but there's always a non-zero chance that a node will be large enough to break our usual encoding scheme (which uses 16-bit ints to store message offsets). To address this, the vector index node uses 32-bit ints to store message offsets instead of the 16 bits used by other node types.

Proximity Map

A ProximityMap is a new implementation of Dolt's Map, a data structure built on Merkle trees that maps key bytestrings to value bytestrings. The ProximityMap is backed by a tree of vector index nodes, allowing it to perform an approximate nearest neighbor search.

Proximity Maps resemble other Prolly Maps, but have the following invariants:

  • Each key must be convertible to a vector. Typically, the key is a val.Tuple, and the vector is the first value in that tuple.
  • The keys are arranged in the tree such that, for each of a key's parent keys (the keys that appear on the path from the root to the key), the key is closer to that parent key than any of the parent key's siblings.
  • The keys in a node are sorted lexographically (note that this is not necessarily the same ordering as the tuple that the key represents), except for the first key which matches its direct parent.

Notably, while the keys of an individual node are sorted, walking all of a vector indexes keys in standard iteration order will not be sorted.

28b7065 and 6b91635 contain the bulk of the ProximityMap implementation.

The bulk of the changes are in these three commits. Each of the other commits is a smaller self-contained change necessary to support vector indexes.

@nicktobey nicktobey force-pushed the nicktobey/proximity-map2 branch from 43fd1e4 to 08f51c5 Compare December 17, 2024 20:27
@nicktobey nicktobey requested review from zachmu and reltuk December 17, 2024 20:48
@nicktobey nicktobey force-pushed the nicktobey/proximity-map2 branch 3 times, most recently from bff6950 to a189e19 Compare January 3, 2025 20:16
@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
a189e19 ok 5937457
version total_tests
a189e19 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
dc05656 ok 5937457
version total_tests
dc05656 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
ce3b325 ok 5937457
version total_tests
ce3b325 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
7f6b0fc ok 5937457
version total_tests
7f6b0fc 5937457
correctness_percentage
100.0

@nicktobey nicktobey force-pushed the nicktobey/proximity-map2 branch from 7f6b0fc to e712abf Compare January 3, 2025 23:05
@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
e712abf ok 5937457
version total_tests
e712abf 5937457
correctness_percentage
100.0

@nicktobey nicktobey force-pushed the nicktobey/proximity-map2 branch from e712abf to 3d20dd6 Compare January 4, 2025 01:01
@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
3d20dd6 ok 5937457
version total_tests
3d20dd6 5937457
correctness_percentage
100.0

@nicktobey nicktobey force-pushed the nicktobey/proximity-map2 branch from 3d20dd6 to eea16a4 Compare January 6, 2025 01:22
@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
eea16a4 ok 5937457
version total_tests
eea16a4 5937457
correctness_percentage
100.0

@nicktobey nicktobey force-pushed the nicktobey/proximity-map2 branch from eea16a4 to 3f75e65 Compare January 6, 2025 05:37
@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
3f75e65 ok 5937457
version total_tests
3f75e65 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
a73a2ce ok 5937457
version total_tests
a73a2ce 5937457
correctness_percentage
100.0

Copy link
Member

@zachmu zachmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although the new types in prolly could probably use a little more documentation on how they fit together, it was initially confusing.

@reltuk should take a look at the files in that package as well, it's not my area of expertise.

go/serial/schema.fbs Outdated Show resolved Hide resolved
go/serial/vectorindexnode.fbs Outdated Show resolved Hide resolved
go/serial/vectorindexnode.fbs Outdated Show resolved Hide resolved
go/libraries/doltcore/sqle/tables.go Outdated Show resolved Hide resolved
go/store/prolly/proximity_map_test.go Outdated Show resolved Hide resolved
go/store/prolly/proximity_map_test.go Outdated Show resolved Hide resolved
go/store/prolly/proximity_map_test.go Show resolved Hide resolved
mustRebuild bool
}

func (f ProximityFlusher) visitNode(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably deserves a comment describing algorithm within at high level, documenting params

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

go/store/prolly/proximity_mutable_map.go Outdated Show resolved Hide resolved
@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
d844f69 ok 5937457
version total_tests
d844f69 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
84f0b04 ok 5937457
version total_tests
84f0b04 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
81d82e9 ok 5937457
version total_tests
81d82e9 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
3fb884f ok 5937457
version total_tests
3fb884f 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
ae76256 ok 5937457
version total_tests
ae76256 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
e6c433e ok 5937457
version total_tests
e6c433e 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
acbf25b ok 5937457
version total_tests
acbf25b 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
d4041dd ok 5937457
version total_tests
d4041dd 5937457
correctness_percentage
100.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants