Skip to content

Commit

Permalink
Detect (new) Bloom/Ribbon Filter construction corruption (facebook#9342)
Browse files Browse the repository at this point in the history
Summary:
Note: rebase on and merge after facebook#9349, facebook#9345, (optional) facebook#9393
**Context:**
(Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following:
a) set of keys to add to filter
b) set of hashes to add to filter (64-bit hash applied to each key)
c) set of Bloom indices to set in filter, with duplicates
d) set of Bloom indices to set in filter, deduplicated
e) final filter and its checksum

This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level.
- b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`)
- c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO.

Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default.

**Summary:**
- Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()`
- Ensured hash entries, final filter and banding and their [cache reservation ](facebook#9073) are released properly despite corruption
   - See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design
   -  Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries`
- Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()`
- When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder.

Pull Request resolved: facebook#9342

Test Plan:
- Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption`
- Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl
   - For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true  -quick -runs 10 | grep 'Build avg'` to make sure nothing break.
- Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()`
   -  FastLocalBloom
       - Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s**
       - After change:
          -  `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)**
          - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)**
          - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)**
    -  Standard128Ribbon
       - Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s**
       - After change:
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)**
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)**
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)**
- Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true`
- Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status.

Reviewed By: pdillinger

Differential Revision: D33746928

Pulled By: hx235

fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57
  • Loading branch information
hx235 authored and facebook-github-bot committed Feb 2, 2022
1 parent 7cd5763 commit 920386f
Show file tree
Hide file tree
Showing 21 changed files with 740 additions and 133 deletions.
3 changes: 3 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@
* Disallow the combination of DBOptions.use_direct_io_for_flush_and_compaction == true and DBOptions.writable_file_max_buffer_size == 0. This combination can cause WritableFileWriter::Append() to loop forever, and it does not make much sense in direct IO.
* `ReadOptions::total_order_seek` no longer affects `DB::Get()`. The original motivation for this interaction has been obsolete since RocksDB has been able to detect whether the current prefix extractor is compatible with that used to generate table files, probably RocksDB 5.14.0.

## New Features
* Introduced an option `BlockBasedTableBuilder::detect_filter_construct_corruption` for detecting corruption during Bloom Filter (format_version >= 5) and Ribbon Filter construction.

## 6.29.0 (01/21/2022)
Note: The next release will be major release 7.0. See https://github.com/facebook/rocksdb/issues/9390 for more info.
### Public API change
Expand Down
312 changes: 281 additions & 31 deletions db/db_bloom_filter_test.cc

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions db_stress_tool/db_stress_common.h
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,7 @@ DECLARE_bool(use_block_based_filter);
DECLARE_int32(ribbon_starting_level);
DECLARE_bool(partition_filters);
DECLARE_bool(optimize_filters_for_memory);
DECLARE_bool(detect_filter_construct_corruption);
DECLARE_int32(index_type);
DECLARE_string(db);
DECLARE_string(secondaries_base);
Expand Down
6 changes: 6 additions & 0 deletions db_stress_tool/db_stress_gflags.cc
Original file line number Diff line number Diff line change
Expand Up @@ -453,6 +453,12 @@ DEFINE_bool(
ROCKSDB_NAMESPACE::BlockBasedTableOptions().optimize_filters_for_memory,
"Minimize memory footprint of filters");

DEFINE_bool(
detect_filter_construct_corruption,
ROCKSDB_NAMESPACE::BlockBasedTableOptions()
.detect_filter_construct_corruption,
"Detect corruption during new Bloom Filter and Ribbon Filter construction");

DEFINE_int32(
index_type,
static_cast<int32_t>(
Expand Down
2 changes: 2 additions & 0 deletions db_stress_tool/db_stress_test_base.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2272,6 +2272,8 @@ void StressTest::Open() {
block_based_options.partition_filters = FLAGS_partition_filters;
block_based_options.optimize_filters_for_memory =
FLAGS_optimize_filters_for_memory;
block_based_options.detect_filter_construct_corruption =
FLAGS_detect_filter_construct_corruption;
block_based_options.index_type =
static_cast<BlockBasedTableOptions::IndexType>(FLAGS_index_type);
block_based_options.prepopulate_block_cache =
Expand Down
31 changes: 31 additions & 0 deletions include/rocksdb/filter_policy.h
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,37 @@ class FilterBitsBuilder {
// The ownership of actual data is set to buf
virtual Slice Finish(std::unique_ptr<const char[]>* buf) = 0;

// Similar to Finish(std::unique_ptr<const char[]>* buf), except that
// for a non-null status pointer argument, it will point to
// Status::Corruption() when there is any corruption during filter
// construction or Status::OK() otherwise.
//
// WARNING: do not use a filter resulted from a corrupted construction
virtual Slice Finish(std::unique_ptr<const char[]>* buf,
Status* /* status */) {
return Finish(buf);
}

// Verify the filter returned from calling FilterBitsBuilder::Finish.
// The function returns Status::Corruption() if there is any corruption in the
// constructed filter or Status::OK() otherwise.
//
// Implementations should normally consult
// FilterBuildingContext::table_options.detect_filter_construct_corruption
// to determine whether to perform verification or to skip by returning
// Status::OK(). The decision is left to the FilterBitsBuilder so that
// verification prerequisites before PostVerify can be skipped when not
// configured.
//
// RocksDB internal will always call MaybePostVerify() on the filter after
// it is returned from calling FilterBitsBuilder::Finish
// except for FilterBitsBuilder::Finish resulting a corruption
// status, which indicates the filter is already in a corrupted state and
// there is no need to post-verify
virtual Status MaybePostVerify(const Slice& /* filter_content */) {
return Status::OK();
}

// Approximate the number of keys that can be added and generate a filter
// <= the specified number of bytes. Callers (including RocksDB) should
// only use this result for optimizing performance and not as a guarantee.
Expand Down
16 changes: 13 additions & 3 deletions include/rocksdb/table.h
Original file line number Diff line number Diff line change
Expand Up @@ -293,16 +293,16 @@ struct BlockBasedTableOptions {
// the memory, if block cache available.
//
// Charged memory usage includes:
// 1. (new) Bloom Filter and Ribbon Filter construction
// 1. Bloom Filter (format_version >= 5) and Ribbon Filter construction
// 2. More to come...
//
// Note:
// 1. (new) Bloom Filter and Ribbon Filter construction
// 1. Bloom Filter (format_version >= 5) and Ribbon Filter construction
//
// If additional temporary memory of Ribbon Filter uses up too much memory
// relative to the avaible space left in the block cache
// at some point (i.e, causing a cache full when strict_capacity_limit =
// true), construction will fall back to (new) Bloom Filter.
// true), construction will fall back to Bloom Filter.
//
// Default: false
bool reserve_table_builder_memory = false;
Expand Down Expand Up @@ -365,6 +365,16 @@ struct BlockBasedTableOptions {
// This must generally be true for gets to be efficient.
bool whole_key_filtering = true;

// If true, detect corruption during Bloom Filter (format_version >= 5)
// and Ribbon Filter construction.
//
// This is an extra check that is only
// useful in detecting software bugs or CPU+memory malfunction.
// Turning on this feature increases filter construction time by 30%.
//
// TODO: optimize this performance
bool detect_filter_construct_corruption = false;

// Verify that decompressing the compressed block gives back the input. This
// is a verification mode that we use to detect bugs in compression
// algorithms.
Expand Down
3 changes: 2 additions & 1 deletion options/options_settable_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,8 @@ TEST_F(OptionsSettableTest, BlockBasedTableOptionsAllFieldsSettable) {
"partition_filters=false;"
"optimize_filters_for_memory=true;"
"index_block_restart_interval=4;"
"filter_policy=bloomfilter:4:true;whole_key_filtering=1;"
"filter_policy=bloomfilter:4:true;whole_key_filtering=1;detect_filter_"
"construct_corruption=false;"
"reserve_table_builder_memory=false;"
"format_version=1;"
"hash_index_allow_collision=false;"
Expand Down
4 changes: 3 additions & 1 deletion options/options_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -856,7 +856,8 @@ TEST_F(OptionsTest, GetBlockBasedTableOptionsFromString) {
"block_size_deviation=8;block_restart_interval=4;"
"format_version=5;whole_key_filtering=1;"
"reserve_table_builder_memory=true;"
"filter_policy=bloomfilter:4.567:false;"
"filter_policy=bloomfilter:4.567:false;detect_filter_construct_"
"corruption=true;"
// A bug caused read_amp_bytes_per_bit to be a large integer in OPTIONS
// file generated by 6.10 to 6.14. Though bug is fixed in these releases,
// we need to handle the case of loading OPTIONS file generated before the
Expand All @@ -876,6 +877,7 @@ TEST_F(OptionsTest, GetBlockBasedTableOptionsFromString) {
ASSERT_EQ(new_opt.block_restart_interval, 4);
ASSERT_EQ(new_opt.format_version, 5U);
ASSERT_EQ(new_opt.whole_key_filtering, true);
ASSERT_EQ(new_opt.detect_filter_construct_corruption, true);
ASSERT_EQ(new_opt.reserve_table_builder_memory, true);
ASSERT_TRUE(new_opt.filter_policy != nullptr);
const BloomFilterPolicy* bfp =
Expand Down
17 changes: 16 additions & 1 deletion table/block_based/block_based_table_builder.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1252,6 +1252,15 @@ void BlockBasedTableBuilder::WriteRawBlock(const Slice& block_contents,
uint32_t checksum = ComputeBuiltinChecksumWithLastByte(
r->table_options.checksum, block_contents.data(), block_contents.size(),
/*last_byte*/ type);

if (block_type == BlockType::kFilter) {
Status s = r->filter_builder->MaybePostVerifyFilter(block_contents);
if (!s.ok()) {
r->SetStatus(s);
return;
}
}

EncodeFixed32(trailer.data() + 1, checksum);
TEST_SYNC_POINT_CALLBACK(
"BlockBasedTableBuilder::WriteRawBlock:TamperWithChecksum",
Expand Down Expand Up @@ -1552,7 +1561,13 @@ void BlockBasedTableBuilder::WriteFilterBlock(
std::unique_ptr<const char[]> filter_data;
Slice filter_content =
rep_->filter_builder->Finish(filter_block_handle, &s, &filter_data);
assert(s.ok() || s.IsIncomplete());

assert(s.ok() || s.IsIncomplete() || s.IsCorruption());
if (s.IsCorruption()) {
rep_->SetStatus(s);
break;
}

rep_->props.filter_size += filter_content.size();

// TODO: Refactor code so that BlockType can determine both the C++ type
Expand Down
5 changes: 5 additions & 0 deletions table/block_based/block_based_table_factory.cc
Original file line number Diff line number Diff line change
Expand Up @@ -356,6 +356,11 @@ static std::unordered_map<std::string, OptionTypeInfo>
{offsetof(struct BlockBasedTableOptions, whole_key_filtering),
OptionType::kBoolean, OptionVerificationType::kNormal,
OptionTypeFlags::kNone}},
{"detect_filter_construct_corruption",
{offsetof(struct BlockBasedTableOptions,
detect_filter_construct_corruption),
OptionType::kBoolean, OptionVerificationType::kNormal,
OptionTypeFlags::kNone}},
{"reserve_table_builder_memory",
{offsetof(struct BlockBasedTableOptions, reserve_table_builder_memory),
OptionType::kBoolean, OptionVerificationType::kNormal,
Expand Down
12 changes: 10 additions & 2 deletions table/block_based/filter_block.h
Original file line number Diff line number Diff line change
Expand Up @@ -83,9 +83,17 @@ class FilterBlockBuilder {
,
Status* status, std::unique_ptr<const char[]>* filter_data = nullptr) = 0;

// It is for releasing the memory usage and cache reservation of filter bits
// builder in FullFilter and PartitionedFilter
// This is called when finishes using the FilterBitsBuilder
// in order to release memory usage and cache reservation
// associated with it timely
virtual void ResetFilterBitsBuilder() {}

// To optionally post-verify the filter returned from
// FilterBlockBuilder::Finish.
// Return Status::OK() if skipped.
virtual Status MaybePostVerifyFilter(const Slice& /* filter_content */) {
return Status::OK();
}
};

// A FilterBlockReader is used to parse filter from SST table.
Expand Down
Loading

0 comments on commit 920386f

Please sign in to comment.