Open
Description
Just want to point that dot4 can be improved with the dpps
instruction (that I just discovered), it requres SSE4.1 (99.84% of cpus in the Steam Hardware Survey, April 2025)
pub fn dot4(v0: Vec, v1: Vec) Vec {
return asm (
\\dpps $0xff, %xmm1, %xmm0
: [ret] "={xmm0}" (-> Vec), // output
: [v0] "{xmm0}" (v0), // inputs
[v1] "{xmm1}" (v1),
);
}
Didn't test if it's how mutch faster it is...
Metadata
Metadata
Assignees
Labels
No labels