Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cuda graph for LoRA #4115

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

Qiaolin-Yu
Copy link
Contributor

@Qiaolin-Yu Qiaolin-Yu commented Mar 6, 2025

Motivation

Close #3282

Modifications

It seems that there are no inherent compatibility issues between CUDA Graph and LoRA, so I simply enabled the configuration.

Benchmarking Result

I run these commands to launch the server.

# Triton backend
python benchmark/lora/launch_server.py --max-loras-per-batch 4 --lora-backend triton

# Base model without lora
# python benchmark/lora/launch_server.py --base-only

Then run this command to request test from client:

python benchmark/lora/lora_bench.py

Benchmark Configs

  • base model: meta-llama/Llama-2-7b-hf
  • lora adapter: winddude/wizardLM-LlaMA-LoRA-7B
  • GPU: Nvidia A100
  • maximum number of serving loras: 4
  • number of requests: 50
  • input length: uniform random distribution on [1, 1024]
  • output length: uniform random distribution on [1, 128]

Here are the results.

Backend Enable Cuda Graph Total Throughput (tok/s) Mean E2E Latency (ms)
Triton True 3518.19 4528.47
Triton False 2010.20 7590.18
No LoRA True 4275.06 3760.48
No LoRA False 3073.95 5009.22

Checklist

@Qiaolin-Yu Qiaolin-Yu marked this pull request as ready for review March 6, 2025 03:19
@Qiaolin-Yu Qiaolin-Yu marked this pull request as draft March 6, 2025 04:19
@Fridge003 Fridge003 mentioned this pull request Mar 6, 2025
58 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Support compatibility between Cuda Graph and Lora
1 participant