Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1 #5407

nakagawa-fj · 2025-07-29T10:09:15Z

This pull request provides a performance improvement for Neoverse V1, addressing Issue #5347.
It differs from the fix in pull request #5353 for A64FX, focusing on matrix size N=2.
While this change primarily enhances performance for N=2, there's potential for further gains up to N=6 on certain architectures. To support this, a new macro, GEMM_DIVIDE_LIMIT, has been introduced to manage the DIVIDE_RATE threshold.
This modification has shown performance improvements for GEMM operations on AWS Graviton3E (Neoverse V1) when N=2, as illustrated in the graph below.

Multi-thread GEMM Performance Improvement on NeoverseV1 (DIVIDE_RATE=1)

7e29f11

martin-frbg added this to the 0.3.31 milestone Jul 30, 2025

martin-frbg merged commit d23680b into OpenMathLib:develop Jul 30, 2025
77 of 88 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1 #5407

Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1 #5407

nakagawa-fj commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1 #5407

Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1 #5407

Conversation

nakagawa-fj commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!