Description
(The title is intentionally rather general even though this does seem somewhat specific, since I'm not 100% sure on whether this is something that could be fixed on the rustc side or LLVM side, or if my analysis is actually correct; but regardless, it was a bit unexpected to me that a function taking a 16-byte enum would have any more overhead than other 16-byte types so I filed it as a bug)
While I was profiling a program that makes somewhat heavy use of indirect function calls that can't be inlined, I noticed that a non-trivial amount of time was spent on movups
/movaps
instructions to copy a 16-byte enum argument that was being passed by pointer.
I believe I reduced the slowdown down to a function that takes a 16-byte enum as an argument (that rustc passes by pointer), and its caller that constructs it on the stack.
#[inline(never)]
pub fn byptr(val: Result<u64, u32>) {
std::hint::black_box(val); // force a load of the parameter which is passed by pointer
}
pub fn test() {
byptr(Ok(1));
}
Here's a full reproducer where the call is wrapped in a tight loop that can be run and shows the slowdown, compared to another function that takes a different 16-byte type also passed by pointer, which doesn't have this issue.
Running that locally, 1
consistently performs about 5x worse than 2
(375ms vs 70ms) on an AMD Ryzen 3 PRO 3200G.
Now this is more or less an educated guess on what could be the cause (I don't have a machine at hand that can run perf
with hardware counters), but looking at the asm, as mentioned above it uses a movups
to load the 16 byte enum into xmm0:
example::byptr::hf156bd35b01c5a6e:
movups xmm0, xmmword ptr [rdi]
At callsite however, to initialize that Ok(1)
value on the stack, it does a 32-bit store at [rdi]
for the discriminant and another 64-bit store at [rdi + 8]
for the payload:
example::test::h8fec37f4e1b51032:
...
mov qword ptr [rsp + 16], 1
mov dword ptr [rsp + 8], 0
lea rdi, [rsp + 8]
Could the slowdown come from a failed store-to-load forward? As far as I know, this pattern of making two smaller stores to create a large value, then loading the large value at once, is a case that store-to-load forwarding can't handle since there's no entry in the store buffer with a matching start address and a greater or equals size, and results in a perf degradation.
If I use (u32, u32, u32, u32)
instead of Result<u64, u32>
I see that it uses a 128 bit store/load via movups
/movaps
on both sides where a fast store-to-load presumably succeeds, and as the repro above showed this is significantly faster.
And likewise, if I change the function to explicitly take an &Result<u64, u32>
and initializing it manually with a movups
makes it fast again, or if I change the function to use two mov
s. So basically, making sure the store/loads match up in their sizes makes the perf degradation go away.
So my question would be, is there something that's preventing Result<u64, u32>
from just being passed by-value as an i128 instead of by memory? A very quick bisection on godbolt shows that it did do that up until 1.60 for the code above. In 1.61 it started passing it by pointer. Interestingly, Result<u64, u64>
does get passed by-value.
Or is this something that would be better fixed on the LLVM side, like not splitting up stores/loads like that? Assuming that this is actually is the issue.
Meta
rustc 1.89.0-nightly (cdd545be1 2025-06-07)
binary: rustc
commit-hash: cdd545be1b4f024d38360aa9f000dcb782fbc81b
commit-date: 2025-06-07
host: x86_64-unknown-linux-gnu
release: 1.89.0-nightly
LLVM version: 20.1.5
(but it really reproduces with any rustc after 1.61)