Slow indirect function calls with 16-byte by-pointer enum argument

(The title is intentionally rather general even though this does seem somewhat specific, since I'm not 100% sure on whether this is something that could be fixed on the rustc side or LLVM side, or if my analysis is actually correct; but regardless, it was a bit unexpected to me that a function taking a 16-byte enum would have any more overhead than other 16-byte types so I filed it as a bug)

While I was profiling a program that makes somewhat heavy use of indirect function calls that can't be inlined, I noticed that a non-trivial amount of time was spent on `movups`/`movaps` instructions to copy a 16-byte enum argument that was being passed by pointer.

I believe I reduced the slowdown down to a function that takes a 16-byte enum as an argument (that rustc passes by pointer), and its caller that constructs it on the stack.
```rs
#[inline(never)]
pub fn byptr(val: Result<u64, u32>) {
  std::hint::black_box(val); // force a load of the parameter which is passed by pointer
}

pub fn test() {
  byptr(Ok(1));
}
```

[Godbolt](https://godbolt.org/z/KdhWEP4j5)

[Here's a full reproducer](https://play.rust-lang.org/?version=nightly&mode=release&edition=2024&gist=c3e67779263badc2b6440c84d213322c) where the call is wrapped in a tight loop that can be run and shows the slowdown, compared to another function that takes a *different* 16-byte type also passed by pointer, which doesn't have this issue.
Running that locally, `1` consistently performs about 5x worse than `2` (375ms vs 70ms) on an AMD Ryzen 3 PRO 3200G.

--------
Now this is more or less an educated guess on what could be the cause (I don't have a machine at hand that can run `perf` with hardware counters), but looking at the asm, as mentioned above it uses a `movups` to load the 16 byte enum into xmm0:
```assembly
example::byptr::hf156bd35b01c5a6e:
        movups  xmm0, xmmword ptr [rdi]
```
At callsite however, to initialize that `Ok(1)` value on the stack, it does a 32-bit store at `[rdi]` for the discriminant and another 64-bit store at `[rdi + 8]` for the payload:
```assembly
example::test::h8fec37f4e1b51032:
...
        mov     qword ptr [rsp + 16], 1
        mov     dword ptr [rsp + 8], 0
        lea     rdi, [rsp + 8]
```
Could the slowdown come from a failed store-to-load forward? As far as I know, this pattern of making two *smaller* stores to create a large value, then loading the *large* value at once, is a case that store-to-load forwarding can't handle since there's no entry in the store buffer with a matching start address and a greater or equals size, and results in a perf degradation.

If I use `(u32, u32, u32, u32)` instead of `Result<u64, u32>` I see that it uses a 128 bit store/load via `movups`/`movaps` on both sides where a fast store-to-load presumably succeeds, and as the repro above showed this is significantly faster.

And likewise, if I change the function to explicitly take an `&Result<u64, u32>` and initializing it manually with a `movups` makes it fast again, or if I change the function to use two `mov`s. So basically, making sure the store/loads match up in their sizes makes the perf degradation go away.

So my question would be, is there something that's preventing `Result<u64, u32>` from just being passed by-value as an i128 instead of by memory? A very quick bisection on godbolt shows that it *did* do that up until 1.60 for the code above. In 1.61 it started passing it by pointer. Interestingly, `Result<u64, u64>` does get passed by-value.
Or is this something that would be better fixed on the LLVM side, like not splitting up stores/loads like that? Assuming that this is actually is the issue.

### Meta
```
rustc 1.89.0-nightly (cdd545be1 2025-06-07)
binary: rustc
commit-hash: cdd545be1b4f024d38360aa9f000dcb782fbc81b
commit-date: 2025-06-07
host: x86_64-unknown-linux-gnu
release: 1.89.0-nightly
LLVM version: 20.1.5
```
(but it really reproduces with any rustc after 1.61)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow indirect function calls with 16-byte by-pointer enum argument #143050

Meta

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow indirect function calls with 16-byte by-pointer enum argument #143050

Description

Meta

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions