[moe training] update tests + benchmarks with conditional runs based on SM arch; make test cases more comprehensive and consistent #2905

danielvegamyhre · 2025-08-28T23:04:53Z

Summary

Add conditional logic to benchmarks and tests to only run if the SM arch supports the underlying GEMMs that are going to run (when applicable), to address float8 rowwise scaled grouped mm doesn't support B200 #2904 (torch._scaled_grouped_mm not supported on B200 yet)
Make all MoE training tests have identical comprehensive parameterization where applicable, for robustness + consistency. All tests now have cases for all combinations of:
- recipes: mxfp8, fp8 rowwise
- execution: eager, compiled
- target fqns: routed experts only, routed experts + shared experts
- distributed tests cover FDSP, TP, FSDP+TP. EP tests coming next.
Add TFLOPs and speedup calculations to grouped gemms bench script
In benchmark_scaled_grouped_mm_dq.py split into (1) fwd time, and (2) e2e fwd+bwd time. Backward 2d-2d GEMM is not ready yet so I want to measure + optimize forward.
Add more tests to test_everything.sh and ensure everything passes.

Test plan

./test/prototype/moe_training/test_everything.sh

pytorch-bot · 2025-08-28T23:04:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2905

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm CI/CD workflows failing due to : download from https://api.github.com/repos/pytorch/pytorch timed out.

⏳ No Failures, 7 Pending

As of commit 7eb6bfd with merge base 083d0c3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo

stamping prototype code

…on SM arch; make test cases more comprehensive and consistent

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Aug 28, 2025

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 28, 2025

danielvegamyhre requested review from liangel-02, vkuzo and drisspg August 28, 2025 23:20

danielvegamyhre force-pushed the bench-updates-08-28 branch from 34e81b1 to 48fb79d Compare August 28, 2025 23:32

This was referenced Aug 28, 2025

float8 rowwise scaled grouped mm doesn't support B200 #2904

Open

[moe training] add test case for shared expert in distributed tests #2856

Closed

danielvegamyhre force-pushed the danielvegamyhre/stack/64 branch from 843448d to 02e246a Compare August 29, 2025 17:16

vkuzo approved these changes Aug 29, 2025

View reviewed changes

liangel-02 approved these changes Aug 29, 2025

View reviewed changes

danielvegamyhre force-pushed the danielvegamyhre/stack/64 branch from 02e246a to e64d4b5 Compare August 29, 2025 18:53

danielvegamyhre changed the base branch from danielvegamyhre/stack/64 to main August 29, 2025 20:02

danielvegamyhre added 4 commits August 29, 2025 13:03

[moe training] update tests + benchmarks with conditional runs based …

d401e70

…on SM arch; make test cases more comprehensive and consistent

split fwd and e2e in autograd func bench script

e47baf7

test updates

5fa7db2

add shared expert cases

7eb6bfd

danielvegamyhre force-pushed the bench-updates-08-28 branch from e93c4a3 to 7eb6bfd Compare August 29, 2025 20:03

danielvegamyhre merged commit 568c193 into main Aug 29, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[moe training] update tests + benchmarks with conditional runs based on SM arch; make test cases more comprehensive and consistent #2905

[moe training] update tests + benchmarks with conditional runs based on SM arch; make test cases more comprehensive and consistent #2905

Uh oh!

danielvegamyhre commented Aug 28, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 28, 2025 •

edited

Loading

Uh oh!

vkuzo left a comment

Uh oh!

Uh oh!

Uh oh!

[moe training] update tests + benchmarks with conditional runs based on SM arch; make test cases more comprehensive and consistent #2905

[moe training] update tests + benchmarks with conditional runs based on SM arch; make test cases more comprehensive and consistent #2905

Uh oh!

Conversation

danielvegamyhre commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

pytorch-bot bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2905

❗ 1 Active SEVs

⏳ No Failures, 7 Pending

Uh oh!

vkuzo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre commented Aug 28, 2025 •

edited

Loading

pytorch-bot bot commented Aug 28, 2025 •

edited

Loading