You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/metadata-design.md
+14-14
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Design Complex Structure On Rocksdb
2
2
3
-
kvrocks use the rocksdb as storage, it's developed by facebook which built on LevelDB with many extra features supports, like column family, transaction, backup, see the rocksdb wiki: [Features Not In LevelDB](https://github.com/facebook/rocksdb/wiki/Features-Not-in-LevelDB). the basic operations in rocksdb are `Put(key, value)`, `Get(key)`, `Delete(key)`, other complex structures weren't supported. the main goal of this doc would explain how we built the Redis hash/list/set/zset/bitmap on rocksdb. most of the design was derived from Qihoo360 `Blackwidow`, but with little modified, like the bitmap design, it's really interesting part.
3
+
kvrocks use the rocksdb as storage, it's developed by facebook which built on LevelDB with many extra features supports, like column family, transaction, backup, see the rocksdb wiki: [Features Not In LevelDB](https://github.com/facebook/rocksdb/wiki/Features-Not-in-LevelDB). the basic operations in rocksdb are `Put(key, value)`, `Get(key)`, `Delete(key)`, other complex structures weren't supported. the main goal of this doc would explain how we built the Redis hash/list/set/zset/bitmap on rocksdb. most of the design was derived from Qihoo360 `Blackwidow`, but with little modified, like the bitmap design, it's a really interesting part.
the value of key we call it metadata here, it stored the metadata of hash key includes:
36
36
37
37
-`flags` like the string, the field was used to tell which type of this key
38
-
-`expire ` is same with string type, record the expire time
38
+
-`expire ` is same as the string type, record the expire time
39
39
-`version` is used to accomplish fast delete when the number of sub keys/values grew bigger
40
40
-`size` records the number sub of keys/values in this hash key
41
41
@@ -49,23 +49,23 @@ key|version|field => | value |
49
49
+---------------+
50
50
```
51
51
52
-
we prepend the hash `key` and `version` before the hash field, the value of `version` was from the metdata. for exmple, when request `hget h1 f1` was received, kvrocks would fetch the metadata by hash key(here is `h1`) first, and concat the hash key, version, field as new key, then fetch the value with new key.
52
+
we prepend the hash `key` and `version` before the hash field, the value of `version` was from the metadata. for example, when the request `hget h1 f1` was received, kvrocks would fetch the metadata by hash key(here is `h1`) first, and concat the hash key, version, field as new key, then fetch the value with new key.
53
53
54
54
55
55
56
-
***Question1: why store version in metadata***
56
+
***Question1: why store version in the metadata***
57
57
58
58
> we store the hash keys/values into single key-value, if the store millions of sub keys-values in one hash key . if user delete this key, the kvrocks must iterator millions of sub keys-values and delete, and it would cause performance problem. with version we can fast delete the metadata and then recycle the others keys-values in compaction background threads. the cost is those tombstone key would take some disk stroage. you can regard the version as atomic increment number, but it's combined with timestamp.
59
59
60
60
61
61
62
-
***Question2: what can we do if the user key was conflicted with composed key?***
62
+
***Question2: what can we do if the user key was conflicted with the composed key?***
63
63
64
64
> we store the metadata key and composed key in different column families, so it wouldn't happend
65
65
66
66
## Set
67
67
68
-
Redis set can be regared as hash with value of sub-key always be null, the metadata was same with the hash:
68
+
Redis set can be regarded as a hash, with the value of sub-key always be null, the metadata was same with the hash:
69
69
70
70
```shell
71
71
+----------+------------+-----------+-----------+
@@ -86,7 +86,7 @@ key|version|member => | NULL |
86
86
87
87
#### list metadata
88
88
89
-
Redis list also organized by metadata and sub keys-values, and sub key is index instead of user key. metadata like below:
89
+
Redis list also organized by metadata and sub keys-values, and sub key is index instead of the user key. metadata like below:
-`head` was used to indicate the start position of list head
99
-
-`tail` was used to indicate the stop position of list tail
98
+
-`head` was used to indicate the start position of the list head
99
+
-`tail` was used to indicate the stop position of the list tail
100
100
101
-
the meaning of other fields were same with other types, just add extra head/tail to record the boundary of list.
101
+
the meaning of other fields was the same as other types, just add extra head/tail to record the boundary of the list.
102
102
103
103
#### list sub keys-values
104
104
105
-
the sub key in list was compsed by list key,version,index, and index was calculated from metadata's head or tail. for example, when user request the `rpush list elem`, kvrocks would fetch fetch the metadata with list key first, and generate the sub key with list key, version, and tail, simply increase the tail, then write the medata and sub key value back to rocksdb.
105
+
the subkey in list was composed by list key,version,index, index was calculated from metadata's head or tail. for example, when the user requests the `rpush list elem`, kvrocks would fetch the metadata with list key first, and generate the subkey with list key, version, and tail, simply increase the tail, then write the metadata and subkey's value back to rocksdb.
106
106
107
107
```shell
108
108
+---------------+
@@ -112,7 +112,7 @@ key|version|index => | value |
112
112
113
113
## ZSet
114
114
115
-
Redis zset was set with sorted property, so it's a little different with other types. it must be able to search with member, as well as retrieve members with score range.
115
+
Redis zset was set with sorted property, so it's a little different with other types. it must be able to search with the member, as well as retrieve members with score range.
if user want to get the score of member or check the member exists or not, it would try first one.
143
+
if the user wants to get the score of the member or check the member exists or not, it would try first one.
144
144
145
145
## Bitmap
146
146
147
-
Redis bitmap is the most interesting part in kvrocks design, while unlike other types, it's not subkey and the value would be very large if the user treats it as a sparse array. it's apparent that the things would break down if store the bitmap into a single value, so we should break the bitmap value into multi fragments. another behavior of bitmap is the position would write always arbitrary, it's very similar to access model of Linux virtual memory, so the idea of the bitmap design came from that.
147
+
Redis bitmap is the most interesting part in kvrocks design, while unlike other types, it's not subkey and the value would be very large if the user treats it as a sparse array. it's apparent that the things would break down if store the bitmap into a single value, so we should break the bitmap value into multi fragments. another behavior of bitmap is the position would write always arbitrary, it's very similar to the access model of Linux virtual memory, so the idea of the bitmap design came from that.
0 commit comments