[JAX] Collective GEMM custom op with `nvte_cublas_gemm` (no comm. overlap) #1307

denera · 2024-11-02T02:30:10Z

Description

Implements both old-style and new FFI-based XLA custom calls in C++, and the corresponding JAX primitive including custom partitioning rules.

Custom partitioning rules for a LHS:([B,] M, K) x RHS:([B,] K, N) = OUT:([B,] M, N) batched mat-mul operation where [B] is the batch dimension:

Preserve the partitioning of the [B] dimension for all operands.
Always all-gather LHS along the M dimension.
Error out if RHS is partitioned in both K and N dimensions.
Force the K dimension of LHS to match the partitioning of the K dimension of RHS.
If K dimension is partitioned but M dimension is not, jax.lax.psum (all-reduce) the output over the TP mesh resource.
If both the M and K dimensions are partitioned, jax.lax.psum_scatter (reduce-scatter) the output over the TP mesh resource.

In practice, the RHS matrix (typically the weight tensor) should be allocated with transposed contracting dimensions ([B,] N, K) for optimal GEMM heuristics in cuBlasLt. This layout is also mandatory for FP8 inputs.

This PR does NOT update fused ops or Flax/Praxis modules to use the new GEMM custom op over the existing XLA pattern matching approach.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Added XLA custom calls for nvte_cublas_gemm.
Added JAX primitive for the new XLA custom call.
Added new serial unit test.
Add distributed unit test.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

nouiz · 2024-11-04T16:44:13Z

Why? Normal JAX behavior is to do some gathering.

huanghua1994 · 2024-11-04T22:58:56Z

It seems that currently the batch size is not handled in the C++ code. Since JAX is using row-major storage for tensor by default, probably the batch dimension should be combined with the m dimension for LHS or the n dimension for RHS?

abhinavgoel95

@denera I have some questions about the PR.

abhinavgoel95 · 2024-11-15T00:05:08Z

transformer_engine/jax/cpp_extensions/gemm.py

+
+        # Validate operand layouts
+        lhs_inner_dim, rhs_inner_dim = map(
+            lambda inner_dim, ndims: (ndims - inner_dim) if inner_dim < 0 else inner_dim,


@denera should be ndims + inner_dim when inner_dim is negative, right?

abhinavgoel95 · 2024-11-15T00:29:34Z

transformer_engine/jax/cpp_extensions/gemm.py

+        rhs_trans = contracting_dims[1] == rhs.ndim - 1
+        lhs = jnp.matrix_transpose(lhs) if lhs_trans and jax_dtype_is_fp8(lhs.dtype) else lhs
+        rhs = jnp.matrix_transpose(rhs) if not rhs_trans and jax_dtype_is_fp8(rhs.dtype) else rhs
+        contracting_dims = (1, 1)


@denera is there a need to hard-code this?

cuBlasLt GEMM requires non-transposed LHS and transposed RHS for FP8 GEMM, but the batcher is not the right place to check/force that. Also, leaving contracting_dims=(1, 1) out of the conditional for FP8 type is a mistake. Thanks for catching it!

abhinavgoel95 · 2024-11-15T00:44:21Z

transformer_engine/jax/cpp_extensions/gemm.py

+            grad=grad,
+            accumulate=accumulate,
+            use_split_accumulator=use_split_accumulator,
+        )(lhs_bdims, out_amax_bdims, out_scale_bdims, gelu_input_bdims, bias_bdims)


This gives me an error.
Line: https://github.com/NVIDIA/TransformerEngine/pull/1307/files#diff-f5b74ca3c5a70acb3d764e9b8adea40b8bab554fe4d2362f3052b7b932c0464dR187-R194 returns a tuple.

TypeError: 'list' object is not callable

cc @denera

phu0ngng · 2024-12-05T13:29:43Z

transformer_engine/jax/fp8.py

+        self._amax_list[FP8MetaPackage.OUTPUT_IDX] = output_amax
+        self._scale_list[FP8MetaPackage.OUTPUT_IDX] = output_scale


Hi,
For the delayed scaling FP8 recipe, the output amax and scale from GEMM are not used anywhere else afterward, so I think we don't need to output and store them.

The FP8 GEMM+RS overlap needs output amax/scale when the communication buffer type is FP8 -- i.e. the overlap algorithms/kernels communicate FP8 GEMM output and fuse BF16 upcast into the sum-reduce.

This PR does not implement TP overlap, but PR #1337 extends the same operations to support TP overlap, so I'm including the output amax/scale infrastructure here.

huanghua1994 · 2024-12-05T21:24:20Z

tests/jax/test_custom_call_compute.py

+        )
+        return a, a_q, jnp.reciprocal(a_scale), b, b_q, jnp.reciprocal(b_scale), bias
+
+    @pytest.mark.parametrize("m,n,k", GEMM_CASES)


Need to provide a list for test parameter b (batch size)

Actually this PR is not supposed to modify test_custom_call_compute.py. These changes are erroneous and need to be removed. Thank you for catching it!

huanghua1994 · 2024-12-05T21:26:53Z

tests/jax/test_custom_call_compute.py

+    def test_gemm(self, b, m, n, k, use_bias, do_gelu):
+        a, b, bias = self._generate_inputs(b, m, n, k, jnp.bfloat16)
+
+        primitive_out = gemm(a, b, bias=bias if use_bias else None, layout="NT", do_gelu=do_gelu)


Do we really need to provide or use layout parameter here? On one hand, user or other functions in TE is unlikely to use this argument (I think C/C++ code would need it but not python code), on the other hand does it make dist-mem sharding complicated?

Changes to this file are erroneous and I just pushed up a commit to restore the original.

All testing for the new collective GEMM custom op are written in test_distributed_gemm.py instead.

Signed-off-by: Alp Dener <[email protected]> Added XLA FFI custom op for TE GEMM Signed-off-by: Alp Dener <[email protected]> finished GEMM custom op primitive and serial unit test Signed-off-by: Alp Dener <[email protected]> fixed GEMM custom op batcher Signed-off-by: Alp Dener <[email protected]> fixed output dtype error and contracting dimensions options Signed-off-by: Alp Dener <[email protected]> AG overlap working but executes scatter to match outer LHS dim Signed-off-by: Alp Dener <[email protected]> both all-gather and all-reduce are now working Signed-off-by: Alp Dener <[email protected]> code style Signed-off-by: Alp Dener <[email protected]> changed kwargs in abstract to be explicit Signed-off-by: Alp Dener <[email protected]> added fwd/bwd implementation for non-fp8 gemm Signed-off-by: Alp Dener <[email protected]>

Signed-off-by: Alp Dener <[email protected]>

for more information, see https://pre-commit.ci

… passing test Signed-off-by: Alp Dener <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Alp Dener <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Alp Dener <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Alp Dener <[email protected]>

for more information, see https://pre-commit.ci

…ide the custom op Signed-off-by: Alp Dener <[email protected]>

…xt-parallel LHS operands Signed-off-by: Alp Dener <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Alp Dener <[email protected]>

… and TP-only meshes Signed-off-by: Alp Dener <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Alp Dener <[email protected]>

huanghua1994 · 2024-12-05T22:10:04Z

tests/jax/test_distributed_gemm.py

+        resources.update(dict(dp_resource="dp"))
+        if parallel_dist == "FSDP_TP":
+            fsdp = True
+            mesh_shape.update(dict(tp=NUM_DEVICES // 2, dp=1, zp=NUM_DEVICES // 2))


This mesh shape calculation is incorrect. Suggested revision:

if parallel_dist in ["DP_TP", "FSDP_TP"]: batched = True tp = NUM_DEVICES // 2 dp = NUM_DEVICES // tp mesh_shape.update(dict(tp=tp, dp=dp)) resources.update(dict(dp_resource="dp")) if parallel_dist == "FSDP_TP": fsdp = True dp = 1 zp = NUM_DEVICES // tp mesh_shape.update(dict(tp=tp, dp=1, zp=zp)) resources.update(dict(fsdp_resource="zp"))

denera added the jax label Nov 2, 2024

denera requested review from nouiz and phu0ngng November 2, 2024 02:30

denera self-assigned this Nov 2, 2024

denera changed the title ~~[JAX] Collective GEMM custom op with nvte_cublas_gemm~~ [JAX] Collective GEMM custom op with nvte_cublas_gemm (no comm. overlap) Nov 2, 2024

denera force-pushed the jax-collective-gemm branch from bb2be56 to fea0728 Compare November 6, 2024 02:13

denera force-pushed the jax-collective-gemm branch 2 times, most recently from 6444211 to f440094 Compare November 14, 2024 18:14

abhinavgoel95 reviewed Nov 15, 2024

View reviewed changes

denera mentioned this pull request Nov 15, 2024

[C/JAX] Comm+GEMM Overlap API for TE/JAX #1337

Draft

13 tasks

phu0ngng requested a review from huanghua1994 November 15, 2024 16:34

denera force-pushed the jax-collective-gemm branch 3 times, most recently from f057def to 718c03d Compare November 21, 2024 11:38

denera marked this pull request as ready for review December 3, 2024 15:05

denera force-pushed the jax-collective-gemm branch from 09e2316 to 39bd494 Compare December 3, 2024 15:06

phu0ngng reviewed Dec 5, 2024

View reviewed changes

huanghua1994 reviewed Dec 5, 2024

View reviewed changes

denera and others added 9 commits December 5, 2024 21:33

fixed batching rules to accommodated batched RHS operand for GEMM

c9774d8

Signed-off-by: Alp Dener <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e523018

for more information, see https://pre-commit.ci

re-applied bug fixes to working older version, updated backward pass,…

2c3dbf1

… passing test Signed-off-by: Alp Dener <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

448eaa9

for more information, see https://pre-commit.ci

batched operands for GEMM custom op seem to be working now

cb6ae3c

Signed-off-by: Alp Dener <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6f67355

for more information, see https://pre-commit.ci

fixed batch size 1 issue and enabled FSDP sharding for RHS operand

4b2b2d4

Signed-off-by: Alp Dener <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2b2753e

for more information, see https://pre-commit.ci

denera and others added 10 commits December 5, 2024 21:33

fixed FSDP+TP w/ DP=1 and TP+DP, but FSDP+TP w/ DP>1 still crashes

969f597

Signed-off-by: Alp Dener <[email protected]>

fixed logic to remove FSDP sharding

ce86dcb

Signed-off-by: Alp Dener <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b215f20

for more information, see https://pre-commit.ci

retained FSDP dims and pushed FSDP all-gather of weight array to outs…

cbab16c

…ide the custom op Signed-off-by: Alp Dener <[email protected]>

Added useful warning about DGRAD sharding not matching sequence/conte…

0ea55c0

…xt-parallel LHS operands Signed-off-by: Alp Dener <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2acb92f

for more information, see https://pre-commit.ci

documentation fixes

b07bb2d

Signed-off-by: Alp Dener <[email protected]>

added unit test, both AG+GEMM and GEMM+AR passing with FSDP+TP, DP+TP…

765b844

… and TP-only meshes Signed-off-by: Alp Dener <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2ce4377

for more information, see https://pre-commit.ci

restored old test_custom_call_compute.py to remove erroneous changes

f68d71e

Signed-off-by: Alp Dener <[email protected]>

denera force-pushed the jax-collective-gemm branch from 07a2fb3 to f68d71e Compare December 5, 2024 21:33

huanghua1994 reviewed Dec 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JAX] Collective GEMM custom op with `nvte_cublas_gemm` (no comm. overlap) #1307

[JAX] Collective GEMM custom op with `nvte_cublas_gemm` (no comm. overlap) #1307

denera commented Nov 2, 2024 •

edited

Loading

nouiz commented Nov 4, 2024

huanghua1994 commented Nov 4, 2024

abhinavgoel95 left a comment

abhinavgoel95 Nov 15, 2024

abhinavgoel95 Nov 15, 2024

denera Nov 15, 2024 •

edited

Loading

abhinavgoel95 Nov 15, 2024

phu0ngng Dec 5, 2024 •

edited

Loading

denera Dec 5, 2024

huanghua1994 Dec 5, 2024

denera Dec 5, 2024

huanghua1994 Dec 5, 2024

denera Dec 5, 2024

huanghua1994 Dec 5, 2024

		self._amax_list[FP8MetaPackage.OUTPUT_IDX] = output_amax
		self._scale_list[FP8MetaPackage.OUTPUT_IDX] = output_scale

[JAX] Collective GEMM custom op with nvte_cublas_gemm (no comm. overlap) #1307

Are you sure you want to change the base?

[JAX] Collective GEMM custom op with nvte_cublas_gemm (no comm. overlap) #1307

Conversation

denera commented Nov 2, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

nouiz commented Nov 4, 2024

huanghua1994 commented Nov 4, 2024

abhinavgoel95 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

denera Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phu0ngng Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[JAX] Collective GEMM custom op with `nvte_cublas_gemm` (no comm. overlap) #1307

[JAX] Collective GEMM custom op with `nvte_cublas_gemm` (no comm. overlap) #1307

denera commented Nov 2, 2024 •

edited

Loading

denera Nov 15, 2024 •

edited

Loading

phu0ngng Dec 5, 2024 •

edited

Loading