Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swiss: optimize bit clear #29

Merged
merged 1 commit into from
Feb 28, 2024

Conversation

thepudds
Copy link
Contributor

When clearing a bit in our bitset in the inner loop of Get, Put, and Delete, we don't need to pass the index of the least significant bit we just found -- we can instead just clear the least significant bit.

Using the idiom of b = b & (b - 1) is simpler and faster than the current bit clearing code. This compiles down to a BLSRQ when building with GOAMD64=v3. For GOARM=7, it looks like it compiles to a LEAQ -1(AX), CX followed by ANDQ CX, AX.

Running the benchmarks seemed to show a geomean improvement of -1.80% sec/op on a GCE n1-standard-16 (with GOAMD64=v3 for both old and new), but it looked a bit noisy so I don't currently entirely trust that number.

When clearing a bit in our bitset in the inner loop of Get, Put, and
Delete, we don't need to pass the index of the least significant bit we
just found -- we can instead just clear the least significant bit.

Using the idiom of `b = b & (b - 1)` is simpler and faster than the
current bit clearing code. This compiles down to a `BLSRQ` when building
with `GOAMD64=v3`. For `GOARM=7`, it looks like it compiles to a `LEAQ
-1(AX), CX` followed by `ANDQ CX, AX`.

Running the benchmarks seemed to show a geomean improvement of -1.80%
sec/op on a GCE n1-standard-16 (with `GOAMD64=v3` for both old and new),
but it looked a bit noisy so I don't currently entirely trust that
number.
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@thepudds
Copy link
Contributor Author

geomean improvement of -1.80% sec/op [...] but it looked a bit noisy so I don't currently entirely trust that number.

Hi @petermattis, I can re-run the benchmarks maybe with a higher -count in the hopes of a less noisy result, but it takes a while to do that, so I was first curious if you are interested in targeted optimizations like this and my earlier #28.

No worries of course if you are not ready for contributions from random people.

To introduce myself briefly -- I was one of the people commenting on golang/go#54766, and I had an earlier SwissTable prototype implementation. These two PRs are things I had noticed in my own prototype. (I had mostly set my version aside when I had concluded I would need to do some more experimentations with different layouts like placing the control bytes with the KVs in each group to perhaps better align with performance expectations people might have built up based on the current runtime map layout... which is part of what was prompting me to ask those questions in golang/go#54766 (comment) last weekend).

Finally, thank you for taking the time to comment your implementation so clearly, and I'm more than thrilled to see your excellent work here looks like it might be en route to the Go runtime!

Copy link
Contributor

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Thanks for the contributions. I'm good with targeted optimization PRs like this. Ack that running with a sufficiently high -count to eliminate noise is onerous.

Reviewable status: 0 of 2 files reviewed, all discussions resolved

@petermattis
Copy link
Contributor

Thanks for the contribution.

@petermattis petermattis merged commit e7c549e into cockroachdb:main Feb 28, 2024
8 of 9 checks passed
@thepudds
Copy link
Contributor Author

thepudds commented Feb 28, 2024

Hi @petermattis, thanks!

One more quick suggestion is to consider using GOAMD64=v3 go test -c or similar for whenever you next update the benchmarks on the README. (In the inner loop, it also gives a more optimized bits.TrailingZeros64 in bitset.first).

@petermattis
Copy link
Contributor

One more quick suggestion is to consider using GOAMD64=v3 go test -c or similar for whenever you next update the benchmarks on the README. (In the inner loop, it also gives a more optimized bits.TrailingZeros64 in bitset.first).

Thanks for the suggestion. TIL about the GOAMD64 env var.

thepudds added a commit to thepudds/cockroachdb-swiss that referenced this pull request Feb 28, 2024
An individual benchmark run can be "unlucky" in terms of how many
overflow buckets there are for the runtime map, or how skewed the
occupancy is for groups in swiss.Map, and so on.

To help with noise and hopefully make it easier to compare results
across different commits, we update bench.sh to do 20 runs per benchmark
rather than 10, but reduce the time per benchmark from 1 second to 0.5
seconds to keep the total time roughly the same. More samples also gives
more grist to the statistical mill for benchstat to identify outliers,
and of course halves the impact of each sample.

We also interleave the swiss.Map and runtime map runs. This can help
reduce the impact of background processes, or "noisy neighbors" in a
public cloud environment, etc. We run swiss.Map first given it is
usually the more interesting data (e.g., in case someone wants to
spot check initial results or ^C early, etc.).

As a follow up to cockroachdb#29, we also set GOAMD64 or GOARM environment
variables based on the platform to help use more modern CPU
instructions.

While we are here, we also fix the munging of the results to account for
the new benchmark name format that was introduced in cockroachdb#22.
thepudds added a commit to thepudds/cockroachdb-swiss that referenced this pull request Feb 28, 2024
An individual benchmark run can be "unlucky" in terms of how many
overflow buckets there are for the runtime map, or how skewed the
occupancy is for groups in swiss.Map, and so on.

To help with noise and hopefully make it easier to compare results
across different commits, we update bench.sh to do 20 runs per benchmark
rather than 10, but reduce the time per benchmark from 1 second to 0.5
seconds to keep the total time roughly the same. More samples also gives
more grist to the statistical mill for benchstat to identify outliers,
and of course halves the impact of each sample.

We also interleave the swiss.Map and runtime map runs. This can help
reduce the impact of background processes, or "noisy neighbors" in a
public cloud environment, etc. We run swiss.Map first given it is
usually the more interesting data (e.g., in case someone wants to
spot check initial results or ^C early, etc.).

As a follow up to cockroachdb#29, we also set GOAMD64 or GOARM environment
variables based on the platform to help use more modern CPU
instructions.

While we are here, we also fix the munging of the results to account for
the new benchmark name format that was introduced in cockroachdb#22.
thepudds added a commit to thepudds/cockroachdb-swiss that referenced this pull request Feb 28, 2024
An individual benchmark run can be "unlucky" in terms of how many
overflow buckets there are for the runtime map, or how skewed the
occupancy is for groups in swiss.Map, and so on.

To help with noise and hopefully make it easier to compare results
across different commits, we update bench.sh to do 20 runs per benchmark
rather than 10, but reduce the time per benchmark from 1 second to 0.5
seconds to keep the total time roughly the same. More samples also gives
more grist to the statistical mill for benchstat to identify outliers,
and of course halves the impact of each sample.

We also interleave the swiss.Map and runtime map runs. This can help
reduce the impact of background processes, or "noisy neighbors" in a
public cloud environment, etc. We run swiss.Map first given it is
usually the more interesting data (e.g., in case someone wants to
spot check initial results or ^C early, etc.).

As a follow up to cockroachdb#29, we also set GOAMD64 or GOARM environment
variables based on the platform to help use more modern CPU
instructions.

While we are here, we also fix the munging of the results to account for
the new benchmark name format that was introduced in cockroachdb#22.
thepudds added a commit to thepudds/cockroachdb-swiss that referenced this pull request Feb 28, 2024
An individual benchmark run can be "unlucky" in terms of how many
overflow buckets there are for the runtime map, or how skewed the
occupancy is for groups in swiss.Map, and so on.

To help with noise and hopefully make it easier to compare results
across different commits, we update bench.sh to do 20 runs per benchmark
rather than 10, but reduce the time per benchmark from 1 second to 0.5
seconds to keep the total time roughly the same. More samples also gives
more grist to the statistical mill for benchstat to identify outliers,
and of course halves the impact of each sample.

We also interleave the swiss.Map and runtime map runs. This can help
reduce the impact of background processes, or "noisy neighbors" in a
public cloud environment, etc. We run swiss.Map first given it is
usually the more interesting data (e.g., in case someone wants to
spot check initial results or ^C early, etc.).

As a follow up to cockroachdb#29, we also set GOAMD64 or GOARM environment
variables based on the platform to help use more modern CPU
instructions.

While we are here, we also fix the munging of the results to account for
the new benchmark name format that was introduced in cockroachdb#22.
thepudds added a commit to thepudds/cockroachdb-swiss that referenced this pull request Feb 28, 2024
An individual benchmark run can be "unlucky" in terms of how many
overflow buckets there are for the runtime map, or how skewed the
occupancy is for groups in swiss.Map, and so on.

To help with noise and hopefully make it easier to compare results
across different commits, we update bench.sh to do 20 runs per benchmark
rather than 10, but reduce the time per benchmark from 1 second to 0.5
seconds to keep the total time roughly the same. More samples also gives
more grist to the statistical mill for benchstat to identify outliers,
and of course halves the impact of each sample.

We also interleave the swiss.Map and runtime map runs. This can help
reduce the impact of background processes, or "noisy neighbors" in a
public cloud environment, etc. We run swiss.Map first given it is
usually the more interesting data (e.g., in case someone wants to
spot check initial results or ^C early, etc.).

As a follow up to cockroachdb#29, we also set GOAMD64 or GOARM environment
variables based on the platform to help use more modern CPU
instructions.

While we are here, we also fix the munging of the results to account for
the new benchmark name format that was introduced in cockroachdb#22.
thepudds added a commit to thepudds/cockroachdb-swiss that referenced this pull request Feb 28, 2024
An individual benchmark run can be "unlucky" in terms of how many
overflow buckets there are for the runtime map, or how skewed the
occupancy is for groups in swiss.Map, and so on.

To help with noise and hopefully make it easier to compare results
across different commits, we update bench.sh to do 20 runs per benchmark
rather than 10, but reduce the time per benchmark from 1 second to 0.5
seconds to keep the total time roughly the same. More samples also gives
more grist to the statistical mill for benchstat to identify outliers,
and of course halves the impact of each sample.

We also interleave the swiss.Map and runtime map runs. This can help
reduce the impact of background processes, or "noisy neighbors" in a
public cloud environment, etc. We run swiss.Map first given it is
usually the more interesting data (e.g., in case someone wants to
spot check initial results or ^C early, etc.).

As a follow up to cockroachdb#29, we also set GOAMD64 or GOARM environment
variables based on the platform to help use more modern CPU
instructions.

While we are here, we also fix the munging of the results to account for
the new benchmark name format that was introduced in cockroachdb#22.
thepudds added a commit to thepudds/cockroachdb-swiss that referenced this pull request Feb 28, 2024
An individual benchmark run can be "unlucky" in terms of how many
overflow buckets there are for the runtime map, or how skewed the
occupancy is for groups in swiss.Map, and so on.

To help with noise and hopefully make it easier to compare results
across different commits, we update bench.sh to do 20 runs per benchmark
rather than 10, but reduce the time per benchmark from 1 second to 0.5
seconds to keep the total time roughly the same. More samples also gives
more grist to the statistical mill for benchstat to identify outliers,
and of course halves the impact of each sample.

We also interleave the swiss.Map and runtime map runs. This can help
reduce the impact of background processes, or "noisy neighbors" in a
public cloud environment, etc. We run swiss.Map first given it is
usually the more interesting data (e.g., in case someone wants to
spot check initial results or ^C early, etc.).

As a follow up to cockroachdb#29, we also set GOAMD64 or GOARM environment
variables based on the platform to help use more modern CPU
instructions.

While we are here, we also fix the munging of the results to account for
the new benchmark name format that was introduced in cockroachdb#22.
thepudds added a commit to thepudds/cockroachdb-swiss that referenced this pull request Feb 28, 2024
An individual benchmark run can be "unlucky" in terms of how many
overflow buckets there are for the runtime map, or how skewed the
occupancy is for groups in swiss.Map, and so on.

To help with noise and hopefully make it easier to compare results
across different commits, we update bench.sh to do 20 runs per benchmark
rather than 10, but reduce the time per benchmark from 1 second to 0.5
seconds to keep the total time roughly the same. More samples also gives
more grist to the statistical mill for benchstat to identify outliers,
and of course halves the impact of each sample.

We also interleave the swiss.Map and runtime map runs. This can help
reduce the impact of background processes, or "noisy neighbors" in a
public cloud environment, etc. We run swiss.Map first given it is
usually the more interesting data (e.g., in case someone wants to
spot check initial results or ^C early, etc.).

As a follow up to cockroachdb#29, we also set the GOAMD64 environment variable
based on the platform to help use more modern CPU instructions.

While we are here, we also fix the munging of the results to account for
the new benchmark name format that was introduced in cockroachdb#22.
thepudds added a commit to thepudds/cockroachdb-swiss that referenced this pull request Feb 28, 2024
An individual benchmark run can be "unlucky" in terms of how many
overflow buckets there are for the runtime map, or how skewed the
occupancy is for groups in swiss.Map, and so on.

To help with noise and hopefully make it easier to compare results
across different commits, we update bench.sh to do 20 runs per benchmark
rather than 10, but reduce the time per benchmark from 1 second to 0.5
seconds to keep the total time roughly the same. More samples also gives
more grist to the statistical mill for benchstat to identify outliers,
and of course halves the impact of each sample.

We also interleave the swiss.Map and runtime map runs. This can help
reduce the impact of background processes, or "noisy neighbors" in a
public cloud environment, etc. We run swiss.Map first given it is
usually the more interesting data (e.g., in case someone wants to
spot check initial results or ^C early, etc.).

As a follow up to cockroachdb#29, we also set the GOAMD64 environment variable
based on the platform to help use more modern CPU instructions.

While we are here, we also fix the munging of the results to account for
the new benchmark name format that was introduced in cockroachdb#22.
thepudds added a commit to thepudds/cockroachdb-swiss that referenced this pull request Feb 28, 2024
An individual benchmark run can be "unlucky" in terms of how many
overflow buckets there are for the runtime map, or how skewed the
occupancy is for groups in swiss.Map, and so on.

To help with noise and hopefully make it easier to compare results
across different commits, we update bench.sh to do 20 runs per benchmark
rather than 10, but reduce the time per benchmark from 1 second to 0.5
seconds to keep the total time roughly the same. More samples also gives
more grist to the statistical mill for benchstat to identify outliers,
and of course halves the impact of each sample.

We also interleave the swiss.Map and runtime map runs. This can help
reduce the impact of background processes, or "noisy neighbors" in a
public cloud environment, etc. We run swiss.Map first given it is
usually the more interesting data (e.g., in case someone wants to
spot check initial results or ^C early, etc.).

As a follow up to cockroachdb#29, we also set the GOAMD64 environment variable
based on the platform to help use more modern CPU instructions.

While we are here, we also fix the munging of the results to account for
the new benchmark name format that was introduced in cockroachdb#22.
thepudds added a commit to thepudds/cockroachdb-swiss that referenced this pull request Feb 28, 2024
An individual benchmark run can be "unlucky" in terms of how many
overflow buckets there are for the runtime map, or how skewed the
occupancy is for groups in swiss.Map, and so on.

To help with noise and hopefully make it easier to compare results
across different commits, we update bench.sh to do 20 runs per benchmark
rather than 10, but reduce the time per benchmark from 1 second to 0.5
seconds to keep the total time roughly the same. More samples also gives
more grist to the statistical mill for benchstat to identify outliers,
and of course halves the impact of each sample.

We also interleave the swiss.Map and runtime map runs. This can help
reduce the impact of background processes, or "noisy neighbors" in a
public cloud environment, etc. We run swiss.Map first given it is
usually the more interesting data (e.g., in case someone wants to
spot check initial results or ^C early, etc.).

As a follow up to cockroachdb#29, we also set the GOAMD64 environment variable
based on the platform to help use more modern CPU instructions.

While we are here, we also fix the munging of the results to account for
the new benchmark name format that was introduced in cockroachdb#22.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants