-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
swiss: optimize bit clear #29
Conversation
When clearing a bit in our bitset in the inner loop of Get, Put, and Delete, we don't need to pass the index of the least significant bit we just found -- we can instead just clear the least significant bit. Using the idiom of `b = b & (b - 1)` is simpler and faster than the current bit clearing code. This compiles down to a `BLSRQ` when building with `GOAMD64=v3`. For `GOARM=7`, it looks like it compiles to a `LEAQ -1(AX), CX` followed by `ANDQ CX, AX`. Running the benchmarks seemed to show a geomean improvement of -1.80% sec/op on a GCE n1-standard-16 (with `GOAMD64=v3` for both old and new), but it looked a bit noisy so I don't currently entirely trust that number.
Hi @petermattis, I can re-run the benchmarks maybe with a higher No worries of course if you are not ready for contributions from random people. To introduce myself briefly -- I was one of the people commenting on golang/go#54766, and I had an earlier SwissTable prototype implementation. These two PRs are things I had noticed in my own prototype. (I had mostly set my version aside when I had concluded I would need to do some more experimentations with different layouts like placing the control bytes with the KVs in each group to perhaps better align with performance expectations people might have built up based on the current runtime map layout... which is part of what was prompting me to ask those questions in golang/go#54766 (comment) last weekend). Finally, thank you for taking the time to comment your implementation so clearly, and I'm more than thrilled to see your excellent work here looks like it might be en route to the Go runtime! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contributions. I'm good with targeted optimization PRs like this. Ack that running with a sufficiently high -count
to eliminate noise is onerous.
Reviewable status: 0 of 2 files reviewed, all discussions resolved
Thanks for the contribution. |
Hi @petermattis, thanks! One more quick suggestion is to consider using |
Thanks for the suggestion. TIL about the |
An individual benchmark run can be "unlucky" in terms of how many overflow buckets there are for the runtime map, or how skewed the occupancy is for groups in swiss.Map, and so on. To help with noise and hopefully make it easier to compare results across different commits, we update bench.sh to do 20 runs per benchmark rather than 10, but reduce the time per benchmark from 1 second to 0.5 seconds to keep the total time roughly the same. More samples also gives more grist to the statistical mill for benchstat to identify outliers, and of course halves the impact of each sample. We also interleave the swiss.Map and runtime map runs. This can help reduce the impact of background processes, or "noisy neighbors" in a public cloud environment, etc. We run swiss.Map first given it is usually the more interesting data (e.g., in case someone wants to spot check initial results or ^C early, etc.). As a follow up to cockroachdb#29, we also set GOAMD64 or GOARM environment variables based on the platform to help use more modern CPU instructions. While we are here, we also fix the munging of the results to account for the new benchmark name format that was introduced in cockroachdb#22.
An individual benchmark run can be "unlucky" in terms of how many overflow buckets there are for the runtime map, or how skewed the occupancy is for groups in swiss.Map, and so on. To help with noise and hopefully make it easier to compare results across different commits, we update bench.sh to do 20 runs per benchmark rather than 10, but reduce the time per benchmark from 1 second to 0.5 seconds to keep the total time roughly the same. More samples also gives more grist to the statistical mill for benchstat to identify outliers, and of course halves the impact of each sample. We also interleave the swiss.Map and runtime map runs. This can help reduce the impact of background processes, or "noisy neighbors" in a public cloud environment, etc. We run swiss.Map first given it is usually the more interesting data (e.g., in case someone wants to spot check initial results or ^C early, etc.). As a follow up to cockroachdb#29, we also set GOAMD64 or GOARM environment variables based on the platform to help use more modern CPU instructions. While we are here, we also fix the munging of the results to account for the new benchmark name format that was introduced in cockroachdb#22.
An individual benchmark run can be "unlucky" in terms of how many overflow buckets there are for the runtime map, or how skewed the occupancy is for groups in swiss.Map, and so on. To help with noise and hopefully make it easier to compare results across different commits, we update bench.sh to do 20 runs per benchmark rather than 10, but reduce the time per benchmark from 1 second to 0.5 seconds to keep the total time roughly the same. More samples also gives more grist to the statistical mill for benchstat to identify outliers, and of course halves the impact of each sample. We also interleave the swiss.Map and runtime map runs. This can help reduce the impact of background processes, or "noisy neighbors" in a public cloud environment, etc. We run swiss.Map first given it is usually the more interesting data (e.g., in case someone wants to spot check initial results or ^C early, etc.). As a follow up to cockroachdb#29, we also set GOAMD64 or GOARM environment variables based on the platform to help use more modern CPU instructions. While we are here, we also fix the munging of the results to account for the new benchmark name format that was introduced in cockroachdb#22.
An individual benchmark run can be "unlucky" in terms of how many overflow buckets there are for the runtime map, or how skewed the occupancy is for groups in swiss.Map, and so on. To help with noise and hopefully make it easier to compare results across different commits, we update bench.sh to do 20 runs per benchmark rather than 10, but reduce the time per benchmark from 1 second to 0.5 seconds to keep the total time roughly the same. More samples also gives more grist to the statistical mill for benchstat to identify outliers, and of course halves the impact of each sample. We also interleave the swiss.Map and runtime map runs. This can help reduce the impact of background processes, or "noisy neighbors" in a public cloud environment, etc. We run swiss.Map first given it is usually the more interesting data (e.g., in case someone wants to spot check initial results or ^C early, etc.). As a follow up to cockroachdb#29, we also set GOAMD64 or GOARM environment variables based on the platform to help use more modern CPU instructions. While we are here, we also fix the munging of the results to account for the new benchmark name format that was introduced in cockroachdb#22.
An individual benchmark run can be "unlucky" in terms of how many overflow buckets there are for the runtime map, or how skewed the occupancy is for groups in swiss.Map, and so on. To help with noise and hopefully make it easier to compare results across different commits, we update bench.sh to do 20 runs per benchmark rather than 10, but reduce the time per benchmark from 1 second to 0.5 seconds to keep the total time roughly the same. More samples also gives more grist to the statistical mill for benchstat to identify outliers, and of course halves the impact of each sample. We also interleave the swiss.Map and runtime map runs. This can help reduce the impact of background processes, or "noisy neighbors" in a public cloud environment, etc. We run swiss.Map first given it is usually the more interesting data (e.g., in case someone wants to spot check initial results or ^C early, etc.). As a follow up to cockroachdb#29, we also set GOAMD64 or GOARM environment variables based on the platform to help use more modern CPU instructions. While we are here, we also fix the munging of the results to account for the new benchmark name format that was introduced in cockroachdb#22.
An individual benchmark run can be "unlucky" in terms of how many overflow buckets there are for the runtime map, or how skewed the occupancy is for groups in swiss.Map, and so on. To help with noise and hopefully make it easier to compare results across different commits, we update bench.sh to do 20 runs per benchmark rather than 10, but reduce the time per benchmark from 1 second to 0.5 seconds to keep the total time roughly the same. More samples also gives more grist to the statistical mill for benchstat to identify outliers, and of course halves the impact of each sample. We also interleave the swiss.Map and runtime map runs. This can help reduce the impact of background processes, or "noisy neighbors" in a public cloud environment, etc. We run swiss.Map first given it is usually the more interesting data (e.g., in case someone wants to spot check initial results or ^C early, etc.). As a follow up to cockroachdb#29, we also set GOAMD64 or GOARM environment variables based on the platform to help use more modern CPU instructions. While we are here, we also fix the munging of the results to account for the new benchmark name format that was introduced in cockroachdb#22.
An individual benchmark run can be "unlucky" in terms of how many overflow buckets there are for the runtime map, or how skewed the occupancy is for groups in swiss.Map, and so on. To help with noise and hopefully make it easier to compare results across different commits, we update bench.sh to do 20 runs per benchmark rather than 10, but reduce the time per benchmark from 1 second to 0.5 seconds to keep the total time roughly the same. More samples also gives more grist to the statistical mill for benchstat to identify outliers, and of course halves the impact of each sample. We also interleave the swiss.Map and runtime map runs. This can help reduce the impact of background processes, or "noisy neighbors" in a public cloud environment, etc. We run swiss.Map first given it is usually the more interesting data (e.g., in case someone wants to spot check initial results or ^C early, etc.). As a follow up to cockroachdb#29, we also set GOAMD64 or GOARM environment variables based on the platform to help use more modern CPU instructions. While we are here, we also fix the munging of the results to account for the new benchmark name format that was introduced in cockroachdb#22.
An individual benchmark run can be "unlucky" in terms of how many overflow buckets there are for the runtime map, or how skewed the occupancy is for groups in swiss.Map, and so on. To help with noise and hopefully make it easier to compare results across different commits, we update bench.sh to do 20 runs per benchmark rather than 10, but reduce the time per benchmark from 1 second to 0.5 seconds to keep the total time roughly the same. More samples also gives more grist to the statistical mill for benchstat to identify outliers, and of course halves the impact of each sample. We also interleave the swiss.Map and runtime map runs. This can help reduce the impact of background processes, or "noisy neighbors" in a public cloud environment, etc. We run swiss.Map first given it is usually the more interesting data (e.g., in case someone wants to spot check initial results or ^C early, etc.). As a follow up to cockroachdb#29, we also set the GOAMD64 environment variable based on the platform to help use more modern CPU instructions. While we are here, we also fix the munging of the results to account for the new benchmark name format that was introduced in cockroachdb#22.
An individual benchmark run can be "unlucky" in terms of how many overflow buckets there are for the runtime map, or how skewed the occupancy is for groups in swiss.Map, and so on. To help with noise and hopefully make it easier to compare results across different commits, we update bench.sh to do 20 runs per benchmark rather than 10, but reduce the time per benchmark from 1 second to 0.5 seconds to keep the total time roughly the same. More samples also gives more grist to the statistical mill for benchstat to identify outliers, and of course halves the impact of each sample. We also interleave the swiss.Map and runtime map runs. This can help reduce the impact of background processes, or "noisy neighbors" in a public cloud environment, etc. We run swiss.Map first given it is usually the more interesting data (e.g., in case someone wants to spot check initial results or ^C early, etc.). As a follow up to cockroachdb#29, we also set the GOAMD64 environment variable based on the platform to help use more modern CPU instructions. While we are here, we also fix the munging of the results to account for the new benchmark name format that was introduced in cockroachdb#22.
An individual benchmark run can be "unlucky" in terms of how many overflow buckets there are for the runtime map, or how skewed the occupancy is for groups in swiss.Map, and so on. To help with noise and hopefully make it easier to compare results across different commits, we update bench.sh to do 20 runs per benchmark rather than 10, but reduce the time per benchmark from 1 second to 0.5 seconds to keep the total time roughly the same. More samples also gives more grist to the statistical mill for benchstat to identify outliers, and of course halves the impact of each sample. We also interleave the swiss.Map and runtime map runs. This can help reduce the impact of background processes, or "noisy neighbors" in a public cloud environment, etc. We run swiss.Map first given it is usually the more interesting data (e.g., in case someone wants to spot check initial results or ^C early, etc.). As a follow up to cockroachdb#29, we also set the GOAMD64 environment variable based on the platform to help use more modern CPU instructions. While we are here, we also fix the munging of the results to account for the new benchmark name format that was introduced in cockroachdb#22.
An individual benchmark run can be "unlucky" in terms of how many overflow buckets there are for the runtime map, or how skewed the occupancy is for groups in swiss.Map, and so on. To help with noise and hopefully make it easier to compare results across different commits, we update bench.sh to do 20 runs per benchmark rather than 10, but reduce the time per benchmark from 1 second to 0.5 seconds to keep the total time roughly the same. More samples also gives more grist to the statistical mill for benchstat to identify outliers, and of course halves the impact of each sample. We also interleave the swiss.Map and runtime map runs. This can help reduce the impact of background processes, or "noisy neighbors" in a public cloud environment, etc. We run swiss.Map first given it is usually the more interesting data (e.g., in case someone wants to spot check initial results or ^C early, etc.). As a follow up to cockroachdb#29, we also set the GOAMD64 environment variable based on the platform to help use more modern CPU instructions. While we are here, we also fix the munging of the results to account for the new benchmark name format that was introduced in cockroachdb#22.
When clearing a bit in our bitset in the inner loop of Get, Put, and Delete, we don't need to pass the index of the least significant bit we just found -- we can instead just clear the least significant bit.
Using the idiom of
b = b & (b - 1)
is simpler and faster than the current bit clearing code. This compiles down to aBLSRQ
when building withGOAMD64=v3
. ForGOARM=7
, it looks like it compiles to aLEAQ -1(AX), CX
followed byANDQ CX, AX
.Running the benchmarks seemed to show a geomean improvement of -1.80% sec/op on a GCE n1-standard-16 (with
GOAMD64=v3
for both old and new), but it looked a bit noisy so I don't currently entirely trust that number.