Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1 #5407
+18
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request provides a performance improvement for Neoverse V1, addressing Issue #5347.
It differs from the fix in pull request #5353 for A64FX, focusing on matrix size N=2.
While this change primarily enhances performance for N=2, there's potential for further gains up to N=6 on certain architectures. To support this, a new macro,
GEMM_DIVIDE_LIMIT
, has been introduced to manage theDIVIDE_RATE
threshold.This modification has shown performance improvements for GEMM operations on AWS Graviton3E (Neoverse V1) when N=2, as illustrated in the graph below.