[Blackwell] Enable MMA pipelining for scaled dot when TMEM copy is used #5812

masahi · 2025-02-04T22:44:10Z

This PR enables MMA pipelining for scaled dot.

The main difficulty this PR overcomes is the dependency cycle between TMEM copy rewriting and SWP - currently TMEM copy rewriting relies on SWP to put loading of scales into SMEM, while to apply MMA pipelining during SWP, TMEM copy rewriting needs to have happened beforehand. I propose to break the cycle by having loading of scales go through local_alloc and local_load in AccelerateMatmul. This way, TMEM copy rewriting happens during the first call to OptimizedDotOperands, before SWP. And the local alloc and load added in AccelerateMatmul are eliminated during SWP. It's a bit ad hoc to add local alloc for scales there, since scales do not need to be in SMEM. But other solutions, like decoupling MMA pipelining from SWP, is more difficult.

The other changes in this PR are for making SWP recognize loading of scales when there is TMEM copy between scale load and MMA.

@ThomasRaoux @pawelszczerbuk @csullivan @mbrookhart @binarybana

Co-authored-by: Jason Knight <[email protected]>

masahi · 2025-02-04T23:59:56Z

Marking as draft for now since one of lit tests is hanging after the last main merge. Debugging

masahi · 2025-02-05T00:38:27Z

Fixed, ready for review

ThomasRaoux

LGTM, added a small nit

lib/Dialect/TritonGPU/Transforms/Pipeliner/TC05MMAPipeline.cpp

Co-authored-by: Thomas Raoux <[email protected]>

@ThomasRaoux

…ed (triton-lang#5812) This PR enables MMA pipelining for scaled dot. The main difficulty this PR overcomes is the dependency cycle between TMEM copy rewriting and SWP - currently TMEM copy rewriting relies on SWP to put loading of scales into SMEM, while to apply MMA pipelining during SWP, TMEM copy rewriting needs to have happened beforehand. I propose to break the cycle by having loading of scales go through `local_alloc` and `local_load` in `AccelerateMatmul`. This way, TMEM copy rewriting happens during [the first call to OptimizedDotOperands,](https://github.com/triton-lang/triton/blob/1e0e51c4aeb3e1beea000da5d0e494f8b9ac40dd/third_party/nvidia/backend/compiler.py#L260) before SWP. And the local alloc and load added in `AccelerateMatmul` are eliminated during SWP. It's a bit ad hoc to add local alloc for scales there, since scales do not need to be in SMEM. But other solutions, like decoupling MMA pipelining from SWP, is more difficult. The other changes in this PR are for making SWP recognize loading of scales when there is TMEM copy between scale load and MMA. @ThomasRaoux @pawelszczerbuk @csullivan @mbrookhart @binarybana --------- Co-authored-by: Masahiro Masuda <[email protected]> Co-authored-by: Jason Knight <[email protected]>

masahi and others added 30 commits January 30, 2025 17:26

load scales in lit test

62b253e

stub

a131bb6

wip

f0c4a78

use 5d scale

c6e45f7

working?

581e7e6

make lit test utccp-compatible

9fda44f

add back 2d scale test

f263972

reenable MMA pipe for scaled dot

ebee5a6

update test

fa1b451

working for swp

d6709e1

Support tmem copy op in transitive use chain

a555e54

minor improv in SWP

3565565

add proper logic to decide when scaled dot is safe to pipeline

50c3e07

format

8baf909

wip

c8aca61

attempt adding explicit barrier wait after UTCCP

293b65d

restore test

7d989b5

Merge branch 'main' into reenable-mma-pipe-bw-mxfp

4d43bea

merge fix

e437762

all tests pass by adding monkey patch for ptxas disable opt

7627f87

fixed BW pipeline test

42b8a8b

add SWP test for utccp

f28471f

move sync lowering pass to ttgir pipeline

3b911e4

wip

4d05667

fix accel matmul test

aeb3be4

Merge branch 'main' into reenable-mma-pipe-bw-mxfp

ff49757

update accel matmul lit test

e171a7c

revert

d7bf456

add test for MMA pipeline with utccp

33d3f6e

precommit

fd5a219

Masahiro Masuda and others added 10 commits February 4, 2025 00:30

add comment

8e73cff

minor

e55e130

improve the note on the workaround in test

b033724

Co-authored-by: Jason Knight <[email protected]>

simplify the workaround comment

c8e5fb9

address feedback

9fe1ce9

Merge branch 'main' into reenable-mma-pipe-bw-mxfp

236b110

fix

00c7db3

precommit

b339461

more comment polish

7830dfc

Merge branch 'main' into reenable-mma-pipe-bw-mxfp

9930e95

masahi requested a review from ptillet as a code owner February 4, 2025 22:44

masahi marked this pull request as draft February 4, 2025 23:34

csullivan mentioned this pull request Feb 5, 2025

[Blackwell][TUTORIALS] Add tutorial 10-block-scaled-matmul.py #5813

Merged

Masahiro Masuda added 2 commits February 5, 2025 00:25

fix in lit test

8c7d071

workaround in accel matmul for lit test having no load on scale

5c737c8

masahi marked this pull request as ready for review February 5, 2025 00:37

ThomasRaoux approved these changes Feb 5, 2025

View reviewed changes

lib/Dialect/TritonGPU/Transforms/Pipeliner/TC05MMAPipeline.cpp Outdated Show resolved Hide resolved

Update lib/Dialect/TritonGPU/Transforms/Pipeliner/TC05MMAPipeline.cpp

a80e014

Co-authored-by: Thomas Raoux <[email protected]>

ThomasRaoux merged commit ac9574c into triton-lang:main Feb 5, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Blackwell] Enable MMA pipelining for scaled dot when TMEM copy is used #5812

[Blackwell] Enable MMA pipelining for scaled dot when TMEM copy is used #5812

masahi commented Feb 4, 2025 •

edited

Loading

masahi commented Feb 4, 2025 •

edited

Loading

masahi commented Feb 5, 2025

ThomasRaoux left a comment

[Blackwell] Enable MMA pipelining for scaled dot when TMEM copy is used #5812

[Blackwell] Enable MMA pipelining for scaled dot when TMEM copy is used #5812

Conversation

masahi commented Feb 4, 2025 • edited Loading

masahi commented Feb 4, 2025 • edited Loading

masahi commented Feb 5, 2025

ThomasRaoux left a comment

Choose a reason for hiding this comment

masahi commented Feb 4, 2025 •

edited

Loading

masahi commented Feb 4, 2025 •

edited

Loading