Support cuda graph for LoRA #4115

Qiaolin-Yu · 2025-03-06T02:01:18Z

Motivation

Modifications

It seems that there are no inherent compatibility issues between CUDA Graph and LoRA, so I simply enabled the configuration.

Benchmarking Result

I run these commands to launch the server.

# Triton backend
python benchmark/lora/launch_server.py --max-loras-per-batch 4 --lora-backend triton

# Base model without lora
# python benchmark/lora/launch_server.py --base-only

Then run this command to request test from client:

python benchmark/lora/lora_bench.py

Benchmark Configs

base model: meta-llama/Llama-2-7b-hf
lora adapter: winddude/wizardLM-LlaMA-LoRA-7B
GPU: Nvidia A100
maximum number of serving loras: 4
number of requests: 50
input length: uniform random distribution on [1, 1024]
output length: uniform random distribution on [1, 128]

Here are the results.

Backend	Enable Cuda Graph	Total Throughput (tok/s)	Mean E2E Latency (ms)
Triton	True	3518.19	4528.47
Triton	False	2010.20	7590.18
No LoRA	True	4275.06	3760.48
No LoRA	False	3073.95	5009.22

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Co-authored-by: Beichen Ma <[email protected]>

Qiaolin-Yu and others added 6 commits March 6, 2025 02:00

Support cuda graph for LoRA

afe9461

Co-authored-by: Beichen Ma <[email protected]>

fix

b1af50e

fix doc

2505818

fix doc

2bdad6d

fix doc

c2ebb20

Merge branch 'main' into lora_cuda_graph

a5dc5e9

Fridge003 mentioned this pull request Mar 6, 2025

[Feature] Lora Development Roadmap #2929

Open

13 tasks

fix bench

25e9f5b

Qiaolin-Yu marked this pull request as ready for review March 6, 2025 03:19

Qiaolin-Yu requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock and ByronHsu as code owners March 6, 2025 03:19

Merge branch 'main' into lora_cuda_graph

5012414

Qiaolin-Yu marked this pull request as draft March 6, 2025 04:19

Fridge003 mentioned this pull request Mar 6, 2025

Development Roadmap (2025 H1) #4042

Open

58 tasks

Merge branch 'main' into lora_cuda_graph

eb8275b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support cuda graph for LoRA #4115

Support cuda graph for LoRA #4115

Qiaolin-Yu commented Mar 6, 2025 •

edited

Loading

Support cuda graph for LoRA #4115

Are you sure you want to change the base?

Support cuda graph for LoRA #4115

Conversation

Qiaolin-Yu commented Mar 6, 2025 • edited Loading

Motivation

Modifications

Benchmarking Result

Checklist

Qiaolin-Yu commented Mar 6, 2025 •

edited

Loading