Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Blackwell] Enable MMA pipelining for scaled dot when TMEM copy is used #5812

Merged
merged 43 commits into from
Feb 5, 2025

Conversation

masahi
Copy link
Collaborator

@masahi masahi commented Feb 4, 2025

This PR enables MMA pipelining for scaled dot.

The main difficulty this PR overcomes is the dependency cycle between TMEM copy rewriting and SWP - currently TMEM copy rewriting relies on SWP to put loading of scales into SMEM, while to apply MMA pipelining during SWP, TMEM copy rewriting needs to have happened beforehand. I propose to break the cycle by having loading of scales go through local_alloc and local_load in AccelerateMatmul. This way, TMEM copy rewriting happens during the first call to OptimizedDotOperands, before SWP. And the local alloc and load added in AccelerateMatmul are eliminated during SWP. It's a bit ad hoc to add local alloc for scales there, since scales do not need to be in SMEM. But other solutions, like decoupling MMA pipelining from SWP, is more difficult.

The other changes in this PR are for making SWP recognize loading of scales when there is TMEM copy between scale load and MMA.

@ThomasRaoux @pawelszczerbuk @csullivan @mbrookhart @binarybana

@masahi masahi requested a review from ptillet as a code owner February 4, 2025 22:44
@masahi masahi marked this pull request as draft February 4, 2025 23:34
@masahi
Copy link
Collaborator Author

masahi commented Feb 4, 2025

Marking as draft for now since one of lit tests is hanging after the last main merge. Debugging

@masahi masahi marked this pull request as ready for review February 5, 2025 00:37
@masahi
Copy link
Collaborator Author

masahi commented Feb 5, 2025

Fixed, ready for review

Copy link
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, added a small nit

@ThomasRaoux ThomasRaoux merged commit ac9574c into triton-lang:main Feb 5, 2025
7 checks passed
AlexAUT pushed a commit to AlexAUT/triton that referenced this pull request Feb 6, 2025
…ed (triton-lang#5812)

This PR enables MMA pipelining for scaled dot.

The main difficulty this PR overcomes is the dependency cycle between
TMEM copy rewriting and SWP - currently TMEM copy rewriting relies on
SWP to put loading of scales into SMEM, while to apply MMA pipelining
during SWP, TMEM copy rewriting needs to have happened beforehand. I
propose to break the cycle by having loading of scales go through
`local_alloc` and `local_load` in `AccelerateMatmul`. This way, TMEM
copy rewriting happens during [the first call to
OptimizedDotOperands,](https://github.com/triton-lang/triton/blob/1e0e51c4aeb3e1beea000da5d0e494f8b9ac40dd/third_party/nvidia/backend/compiler.py#L260)
before SWP. And the local alloc and load added in `AccelerateMatmul` are
eliminated during SWP. It's a bit ad hoc to add local alloc for scales
there, since scales do not need to be in SMEM. But other solutions, like
decoupling MMA pipelining from SWP, is more difficult.

The other changes in this PR are for making SWP recognize loading of
scales when there is TMEM copy between scale load and MMA.

@ThomasRaoux @pawelszczerbuk @csullivan @mbrookhart @binarybana

---------

Co-authored-by: Masahiro Masuda <[email protected]>
Co-authored-by: Jason Knight <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants