-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Comparison Between vLLM and Llumnix Deployment #100
Comments
Hi, can you share the scripts and parameters for vLLM/llumnix deployment test, we have observed some performance degradation under high concurrency (mainly due to the overhead of ray rpc), and we are working on resolving it. |
Does it mean llumnix has better performance(lower latency)? We haven't do comparison tests with direct vLLM deployment after upgrading to vLLM v0.6.3.post1, in previous version we haven't observe any performance issue compared with vLLM. |
Sorry for the confusion earlier! I meant to say that the TTFT and TPOT for Llumnix are about 20% higher than for vLLM (direct deployment), indicating that Llumnix performs worse than vLLM directly. For the test, we used the following parameters: --tensor-parallel-size 1 --block-size 32 --cpu-offload-gb 0 --dtype bfloat16 --enable-prefix-caching --gpu-memory-utilization 0.83 --guided-decoding-backend outlines --kv-cache-dtype auto --load-format auto --max-logprobs 20 --max-model-len 4096 --max-num-batched-tokens 8096 --max-num-seqs 1024 --max-seq-len-to-capture 8192 --num-lookahead-slots 0 --num-scheduler-steps 8 --response-role assistant --scheduler-delay-factor 0 --seed 0 --swap-space 8 --tokenizer-mode auto For example, with single-instance tests, we observed that the GPU utilization in vLLM can maintain around 98%, but in Llumnix, it’s only about 96%. Here are the results for each system: vLLM (GPU 98%): Llumnix (GPU 96%): Additionally, we tried modifying the executor_class from LumnixRayGPUExecutor to GPUExecutorAsync to reduce the impact of RPC, but the performance change was minimal. May I ask what other investigation ideas are available? If so, we can assist in testing and verification. |
We have done some benchmarks recently and found similar performance degradation compared to the direct deployment of vLLM. However, we have supported the use_ray_spmd_worker and scheduler that only send delta data recently (these two are both features of vLLM v0.6.3.post1), and when we enable these two features, we observe that llumnix's performance is slightly better than vLLM under a 1-instance deployment (request and decode latency is slightly lower than vLLM, and prefill latency is slightly higher). We are working on figuring out the exact reason and will sync with you in time once we have conclusions. |
We have addressed this issue in pr. |
We encountered an issue during testing with Llumnix:
Comparison of Two Deployment Modes:
1. Direct vLLM deployment (5 instances)
2. Llumnix deployment (5 instances) with only request load enabled (--dispatch-policy load) and migration disabled (--enable-migration)
Test Setup:
• Instances: 5
• Model: qwen2-7b
• GPU: 4090 24GB
• Concurrency: 120
• vLLM version: v0.6.3.post1
Findings:
We observed that the TTFT and TPOT for Llumnix are about 20% lower compared to vLLM (direct deployment). Additionally, the GPU usage is lower in Llumnix compared to vLLM.
Possible Cause:
We suspect the performance difference could be related to the Ray framework being used in Llumnix.
Questions:
• Have you done any performance comparisons like this before?
• Do you have any recommendations or insights to improve performance?
The text was updated successfully, but these errors were encountered: