Add fast decode plan for flashinfer mla #3987

Fridge003 · 2025-03-02T03:01:21Z

Motivation

When using flashinfer mla backend and cuda graph together, graph replay will be hanged due to transmission of indptr tensors between cpu and gpu in BatchMLAPagedAttentionWrapper.plan.

This PR fixes this issue by adding a new decode_seq_len_cpu in forward batch and customizing a faster decode plan for graph replaying.

Also, some issues (#3906, #3917) points out current flashinfer mla backend behaves worse than triton in long output cases. Hopefully this PR will fix this problem.

Modifications

Add a new decode_seq_len_cpu in forward batch, which puts the information of seq_lens on cpu in advance.
Write fast_mla_decode_plan that can avoid transmitting indptr tensors from gpu to cpu during graph replaying.

Accuracy

Launching

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code --enable-flashinfer-mla

GSM8K

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

Accuracy: 0.956
Invalid: 0.000
Latency: 101.581 s
Output throughput: 1336.431 token/s

MMLU

bash benchmark/mmlu/download_data.sh
python3 benchmark/mmlu/bench_sglang.py --nsub 100 --ntrain 5 --parallel 2000

Total latency: 182.686
Average accuracy: 0.871

Benchmark

To better discover the improvement of this PR, the benchmarks are run on long output workloads (so number of graph replaying can be increased) with Deepseek-v2-lite. Machine is Nvidia H200. Each benchmark is run five times and the average throughput is computed. After this PR, throughput of flashinfer mla on these workloads can be improved by 1% to 2%.

To Launch:

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V2-Lite --tp 8 --trust-remote-code --enable-flashinfer-mla

Input-4096-Output-2048 (same workload as #3917)

python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 4096 --random-output 2048 --num-prompt 60

Throughput (tok/s)	Flashinfer (this PR)	Flashinfer (before PR)	Triton
Prefill	7757.64	7540.28	6563.27
Decode	4038.21	3925.06	3416.69

Input-180-Output-400 (same workload as #3906)

python3 -m sglang.bench_serving  --dataset-name=random --num-prompts=600    --random-range-ratio 0.9 --seed 42  --random-input 180 --random-output 400  --request-rate 40 --max-concurrency 40

Throughput (tok/s)	Flashinfer (this PR)	Flashinfer (before PR)	Triton
Prefill	1585.48	1558.35	1532.36
Decode	3532.19	3480.68	3413.85

Input-100-Output-2000

python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 100 --random-output 2000 --request-rate 2 --num-prompt 120

Throughput (tok/s)	Flashinfer (this PR)	Flashinfer (before PR)	Triton
Prefill	80.63	80.22	80.00
Decode	1751.71	1742.90	1737.88

Profiler Result

After profiling with torch profiler, we can see the time of waiting for memcpyAsync is removed. Since MLA with absorbed is compute bound and GPU is fully utilized, its influence on e2e throughput is not obvious.

Before this PR:

After this PR:

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

merrymercy · 2025-03-03T03:24:42Z

python/sglang/srt/layers/attention/flashinfer_backend.py

        spec_info: Optional[SpecInfo],
+        **kwargs,


Do not use kwargs. It makes the code more unreadable because we do not know what exact arguments are.
Is it possible to specify it more clearly?

merrymercy · 2025-03-03T03:25:49Z

python/sglang/srt/layers/attention/triton_backend.py

        spec_info: Optional[SpecInfo],
+        **kwargs,


this should be removed.

merrymercy · 2025-03-03T03:27:24Z

python/sglang/srt/managers/schedule_batch.py

@@ -1168,8 +1171,10 @@ def merge_batch(self, other: "ScheduleBatch"):

    def get_model_worker_batch(self):
        if self.forward_mode.is_decode_or_idle():
+            decode_seq_lens = self.seq_lens.cpu()


This will slowdown other things (e.g., speculative decoding where overlap scheduler is turned off). Can we only do this when needed?

merrymercy · 2025-03-03T03:28:02Z

python/sglang/srt/model_executor/cuda_graph_runner.py


        # Common inputs
        self.input_ids[:raw_num_token].copy_(forward_batch.input_ids)
        self.req_pool_indices[:raw_bs].copy_(forward_batch.req_pool_indices)
        self.seq_lens[:raw_bs].copy_(forward_batch.seq_lens)
        self.out_cache_loc[:raw_num_token].copy_(forward_batch.out_cache_loc)
        self.positions[:raw_num_token].copy_(forward_batch.positions)
+        if forward_batch.decode_seq_lens_cpu is not None:
+            self.seq_lens_cpu[:raw_bs].copy_(forward_batch.decode_seq_lens_cpu)


if it is a CPU tensor, it does not need to go through these CUDA graph things.

This reverts commit fa56106.

zhyncs assigned Ying1123, merrymercy, zhyncs and Fridge003 Mar 2, 2025

zhyncs added high priority flashinfer labels Mar 2, 2025

Fridge003 marked this pull request as ready for review March 2, 2025 08:04

Fridge003 requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and HaiShaw as code owners March 2, 2025 08:04

Add fast decode plan for flashinfer mla

05e4373

Fridge003 force-pushed the deepseek branch from 1948f3e to 05e4373 Compare March 2, 2025 09:20

Fridge003 and others added 5 commits March 2, 2025 11:28

update docs and fix bug

5af5f5c

fix

23ca5f0

Merge branch 'main' into deepseek

3653ef0

Merge branch 'main' into deepseek

9a61cee

Merge branch 'main' into deepseek

9d77317

zhyncs approved these changes Mar 3, 2025

View reviewed changes

zhyncs merged commit fa56106 into sgl-project:main Mar 3, 2025
1 of 16 checks passed

merrymercy reviewed Mar 3, 2025

View reviewed changes

merrymercy added a commit that referenced this pull request Mar 3, 2025

Revert "Add fast decode plan for flashinfer mla (#3987)"

cd2c7a2

This reverts commit fa56106.

merrymercy mentioned this pull request Mar 3, 2025

Revert "Add fast decode plan for flashinfer mla" #4008

Merged

Fridge003 mentioned this pull request Mar 3, 2025

[Revision] Add fast decode plan for flashinfer mla #4012

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fast decode plan for flashinfer mla #3987

Add fast decode plan for flashinfer mla #3987

Fridge003 commented Mar 2, 2025 •

edited

Loading

merrymercy Mar 3, 2025

merrymercy Mar 3, 2025

merrymercy Mar 3, 2025

merrymercy Mar 3, 2025

Add fast decode plan for flashinfer mla #3987

Add fast decode plan for flashinfer mla #3987

Conversation

Fridge003 commented Mar 2, 2025 • edited Loading

Motivation

Modifications

Accuracy

Launching

GSM8K

MMLU

Benchmark

Input-4096-Output-2048 (same workload as #3917)

Input-180-Output-400 (same workload as #3906)

Input-100-Output-2000

Profiler Result

Checklist

merrymercy Mar 3, 2025

Choose a reason for hiding this comment

merrymercy Mar 3, 2025

Choose a reason for hiding this comment

merrymercy Mar 3, 2025

Choose a reason for hiding this comment

merrymercy Mar 3, 2025

Choose a reason for hiding this comment

Fridge003 commented Mar 2, 2025 •

edited

Loading