swiss: optimize bit clear #29

thepudds · 2024-02-27T19:06:57Z

When clearing a bit in our bitset in the inner loop of Get, Put, and Delete, we don't need to pass the index of the least significant bit we just found -- we can instead just clear the least significant bit.

Using the idiom of b = b & (b - 1) is simpler and faster than the current bit clearing code. This compiles down to a BLSRQ when building with GOAMD64=v3. For GOARM=7, it looks like it compiles to a LEAQ -1(AX), CX followed by ANDQ CX, AX.

Running the benchmarks seemed to show a geomean improvement of -1.80% sec/op on a GCE n1-standard-16 (with GOAMD64=v3 for both old and new), but it looked a bit noisy so I don't currently entirely trust that number.

When clearing a bit in our bitset in the inner loop of Get, Put, and Delete, we don't need to pass the index of the least significant bit we just found -- we can instead just clear the least significant bit. Using the idiom of `b = b & (b - 1)` is simpler and faster than the current bit clearing code. This compiles down to a `BLSRQ` when building with `GOAMD64=v3`. For `GOARM=7`, it looks like it compiles to a `LEAQ -1(AX), CX` followed by `ANDQ CX, AX`. Running the benchmarks seemed to show a geomean improvement of -1.80% sec/op on a GCE n1-standard-16 (with `GOAMD64=v3` for both old and new), but it looked a bit noisy so I don't currently entirely trust that number.

cockroach-teamcity · 2024-02-27T19:07:04Z

This change is

thepudds · 2024-02-27T19:18:31Z

geomean improvement of -1.80% sec/op [...] but it looked a bit noisy so I don't currently entirely trust that number.

Hi @petermattis, I can re-run the benchmarks maybe with a higher -count in the hopes of a less noisy result, but it takes a while to do that, so I was first curious if you are interested in targeted optimizations like this and my earlier #28.

No worries of course if you are not ready for contributions from random people.

To introduce myself briefly -- I was one of the people commenting on golang/go#54766, and I had an earlier SwissTable prototype implementation. These two PRs are things I had noticed in my own prototype. (I had mostly set my version aside when I had concluded I would need to do some more experimentations with different layouts like placing the control bytes with the KVs in each group to perhaps better align with performance expectations people might have built up based on the current runtime map layout... which is part of what was prompting me to ask those questions in golang/go#54766 (comment) last weekend).

Finally, thank you for taking the time to comment your implementation so clearly, and I'm more than thrilled to see your excellent work here looks like it might be en route to the Go runtime!

petermattis

Thanks for the contributions. I'm good with targeted optimization PRs like this. Ack that running with a sufficiently high -count to eliminate noise is onerous.

Reviewable status: 0 of 2 files reviewed, all discussions resolved

petermattis · 2024-02-28T12:31:37Z

Thanks for the contribution.

thepudds · 2024-02-28T12:49:40Z

Hi @petermattis, thanks!

One more quick suggestion is to consider using GOAMD64=v3 go test -c or similar for whenever you next update the benchmarks on the README. (In the inner loop, it also gives a more optimized bits.TrailingZeros64 in bitset.first).

petermattis · 2024-02-28T12:56:31Z

One more quick suggestion is to consider using GOAMD64=v3 go test -c or similar for whenever you next update the benchmarks on the README. (In the inner loop, it also gives a more optimized bits.TrailingZeros64 in bitset.first).

Thanks for the suggestion. TIL about the GOAMD64 env var.

An individual benchmark run can be "unlucky" in terms of how many overflow buckets there are for the runtime map, or how skewed the occupancy is for groups in swiss.Map, and so on. To help with noise and hopefully make it easier to compare results across different commits, we update bench.sh to do 20 runs per benchmark rather than 10, but reduce the time per benchmark from 1 second to 0.5 seconds to keep the total time roughly the same. More samples also gives more grist to the statistical mill for benchstat to identify outliers, and of course halves the impact of each sample. We also interleave the swiss.Map and runtime map runs. This can help reduce the impact of background processes, or "noisy neighbors" in a public cloud environment, etc. We run swiss.Map first given it is usually the more interesting data (e.g., in case someone wants to spot check initial results or ^C early, etc.). As a follow up to cockroachdb#29, we also set GOAMD64 or GOARM environment variables based on the platform to help use more modern CPU instructions. While we are here, we also fix the munging of the results to account for the new benchmark name format that was introduced in cockroachdb#22.

An individual benchmark run can be "unlucky" in terms of how many overflow buckets there are for the runtime map, or how skewed the occupancy is for groups in swiss.Map, and so on. To help with noise and hopefully make it easier to compare results across different commits, we update bench.sh to do 20 runs per benchmark rather than 10, but reduce the time per benchmark from 1 second to 0.5 seconds to keep the total time roughly the same. More samples also gives more grist to the statistical mill for benchstat to identify outliers, and of course halves the impact of each sample. We also interleave the swiss.Map and runtime map runs. This can help reduce the impact of background processes, or "noisy neighbors" in a public cloud environment, etc. We run swiss.Map first given it is usually the more interesting data (e.g., in case someone wants to spot check initial results or ^C early, etc.). As a follow up to cockroachdb#29, we also set the GOAMD64 environment variable based on the platform to help use more modern CPU instructions. While we are here, we also fix the munging of the results to account for the new benchmark name format that was introduced in cockroachdb#22.

petermattis approved these changes Feb 27, 2024

View reviewed changes

petermattis merged commit e7c549e into cockroachdb:main Feb 28, 2024
8 of 9 checks passed

thepudds mentioned this pull request Feb 28, 2024

bench: use GOAMD64=v3, increase benchmark count, interleave runs thepudds/cockroachdb-swiss#3

Open

thepudds mentioned this pull request Feb 28, 2024

bench: use GOAMD64=v3, increase benchmark count, interleave runs #31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swiss: optimize bit clear #29

swiss: optimize bit clear #29

thepudds commented Feb 27, 2024

cockroach-teamcity commented Feb 27, 2024

thepudds commented Feb 27, 2024

petermattis left a comment

petermattis commented Feb 28, 2024

thepudds commented Feb 28, 2024 •

edited

Loading

petermattis commented Feb 28, 2024

swiss: optimize bit clear #29

swiss: optimize bit clear #29

Conversation

thepudds commented Feb 27, 2024

cockroach-teamcity commented Feb 27, 2024

thepudds commented Feb 27, 2024

petermattis left a comment

Choose a reason for hiding this comment

petermattis commented Feb 28, 2024

thepudds commented Feb 28, 2024 • edited Loading

petermattis commented Feb 28, 2024

thepudds commented Feb 28, 2024 •

edited

Loading