Performance Comparison Between vLLM and Llumnix Deployment #100

Ronniexie · 2025-02-11T13:12:40Z

We encountered an issue during testing with Llumnix:

Comparison of Two Deployment Modes:
1. Direct vLLM deployment (5 instances)
2. Llumnix deployment (5 instances) with only request load enabled (--dispatch-policy load) and migration disabled (--enable-migration)

Test Setup:
• Instances: 5
• Model: qwen2-7b
• GPU: 4090 24GB
• Concurrency: 120
• vLLM version: v0.6.3.post1

Findings:
We observed that the TTFT and TPOT for Llumnix are about 20% lower compared to vLLM (direct deployment). Additionally, the GPU usage is lower in Llumnix compared to vLLM.

Possible Cause:
We suspect the performance difference could be related to the Ray framework being used in Llumnix.

Questions:
• Have you done any performance comparisons like this before?
• Do you have any recommendations or insights to improve performance?

ZeldaHuang · 2025-02-12T03:05:03Z

Hi, can you share the scripts and parameters for vLLM/llumnix deployment test, we have observed some performance degradation under high concurrency (mainly due to the overhead of ray rpc), and we are working on resolving it.

ZeldaHuang · 2025-02-12T03:37:02Z

Findings:
We observed that the TTFT and TPOT for Llumnix are about 20% lower compared to vLLM (direct deployment). Additionally, the GPU usage is lower in Llumnix compared to vLLM.

Does it mean llumnix has better performance(lower latency)? We haven't do comparison tests with direct vLLM deployment after upgrading to vLLM v0.6.3.post1, in previous version we haven't observe any performance issue compared with vLLM.

Ronniexie · 2025-02-14T09:49:26Z

Sorry for the confusion earlier! I meant to say that the TTFT and TPOT for Llumnix are about 20% higher than for vLLM (direct deployment), indicating that Llumnix performs worse than vLLM directly.

For the test, we used the following parameters:

--tensor-parallel-size 1 --block-size 32 --cpu-offload-gb 0 --dtype bfloat16 --enable-prefix-caching --gpu-memory-utilization 0.83 --guided-decoding-backend outlines --kv-cache-dtype auto --load-format auto --max-logprobs 20 --max-model-len 4096 --max-num-batched-tokens 8096 --max-num-seqs 1024 --max-seq-len-to-capture 8192 --num-lookahead-slots 0 --num-scheduler-steps 8 --response-role assistant --scheduler-delay-factor 0 --seed 0 --swap-space 8 --tokenizer-mode auto

For example, with single-instance tests, we observed that the GPU utilization in vLLM can maintain around 98%, but in Llumnix, it’s only about 96%.

Here are the results for each system:

vLLM (GPU 98%):
• “mean_ttft”: 572.82
• “median_tpot”: 21.78
• “p90_ttft”: 675.15
• “p90_tpot”: 27.68

Llumnix (GPU 96%):
• “mean_ttft”: 611.42
• “median_tpot”: 22.37
• “p90_ttft”: 741.75
• “p90_tpot”: 27.68

Additionally, we tried modifying the executor_class from LumnixRayGPUExecutor to GPUExecutorAsync to reduce the impact of RPC, but the performance change was minimal.

May I ask what other investigation ideas are available? If so, we can assist in testing and verification.

s5u13b · 2025-02-21T01:56:20Z

Sorry for the confusion earlier! I meant to say that the TTFT and TPOT for Llumnix are about 20% higher than for vLLM (direct deployment), indicating that Llumnix performs worse than vLLM directly.

For the test, we used the following parameters:

--tensor-parallel-size 1 --block-size 32 --cpu-offload-gb 0 --dtype bfloat16 --enable-prefix-caching --gpu-memory-utilization 0.83 --guided-decoding-backend outlines --kv-cache-dtype auto --load-format auto --max-logprobs 20 --max-model-len 4096 --max-num-batched-tokens 8096 --max-num-seqs 1024 --max-seq-len-to-capture 8192 --num-lookahead-slots 0 --num-scheduler-steps 8 --response-role assistant --scheduler-delay-factor 0 --seed 0 --swap-space 8 --tokenizer-mode auto

For example, with single-instance tests, we observed that the GPU utilization in vLLM can maintain around 98%, but in Llumnix, it’s only about 96%.

Here are the results for each system:

vLLM (GPU 98%): • “mean_ttft”: 572.82 • “median_tpot”: 21.78 • “p90_ttft”: 675.15 • “p90_tpot”: 27.68

Llumnix (GPU 96%): • “mean_ttft”: 611.42 • “median_tpot”: 22.37 • “p90_ttft”: 741.75 • “p90_tpot”: 27.68

Additionally, we tried modifying the executor_class from LumnixRayGPUExecutor to GPUExecutorAsync to reduce the impact of RPC, but the performance change was minimal.

May I ask what other investigation ideas are available? If so, we can assist in testing and verification.

We have done some benchmarks recently and found similar performance degradation compared to the direct deployment of vLLM. However, we have supported the use_ray_spmd_worker and scheduler that only send delta data recently (these two are both features of vLLM v0.6.3.post1), and when we enable these two features, we observe that llumnix's performance is slightly better than vLLM under a 1-instance deployment (request and decode latency is slightly lower than vLLM, and prefill latency is slightly higher). We are working on figuring out the exact reason and will sync with you in time once we have conclusions.

Ronniexie · 2025-02-21T07:15:58Z

We have identified a performance issue using Llumnix via torch.profiler. It turns out that when using Llumnix, the _process_mode_outputs function becomes synchronous, which leads to blocking the execution of the next step, affecting performance.

The root cause appears to be the initialization of self.engine.scheduler with SchedulerLlumnix:

self.engine.scheduler = [SchedulerLlumnix(self.engine.scheduler_config, self.engine.cache_config, self.engine.lora_config) for _ in range(engine_args.pipeline_parallel_size)]

This initialization results in the use_async_output_proc configuration being set to false, which is the cause of the performance degradation.

Would it be possible for me to be added as a Contributor so that I can submit the modified code?

s5u13b · 2025-02-21T07:46:06Z

We have identified a performance issue using Llumnix via torch.profiler. It turns out that when using Llumnix, the _process_mode_outputs function becomes synchronous, which leads to blocking the execution of the next step, affecting performance.
The root cause appears to be the initialization of self.engine.scheduler with SchedulerLlumnix:
self.engine.scheduler = [SchedulerLlumnix(self.engine.scheduler_config, self.engine.cache_config, self.engine.lora_config) for _ in range(engine_args.pipeline_parallel_size)]

This initialization results in the use_async_output_proc configuration being set to false, which is the cause of the performance degradation.

Would it be possible for me to be added as a Contributor so that I can submit the modified code?

Thank you for your work! You can fork our repo and submit a pr. By the way, the root cause that you found also explains why the performance degradation does not happen when enabling use_ray_spmd_worker, because vLLM v0.6.3.post1 does not support asynchronous output processing when enabling use_ray_spmd_worker.

…ix Deployment

s5u13b · 2025-02-26T10:59:00Z

We have addressed this issue in pr.

ZeldaHuang added the enhancement New feature or request label Feb 12, 2025

ZeldaHuang marked this as a duplicate of #99 Feb 12, 2025

ZeldaHuang marked this as a duplicate of #98 Feb 12, 2025

zhypku assigned ZeldaHuang Feb 12, 2025

Ronniexie pushed a commit to Ronniexie/llumnix that referenced this issue Feb 21, 2025

Fix bug AlibabaPAI#100: Performance Comparison Between vLLM and Llumn…

1c6aacc

…ix Deployment

Ronniexie added a commit to Ronniexie/llumnix that referenced this issue Feb 21, 2025

Fix bug AlibabaPAI#100: Performance Comparison Between vLLM and Llumn…

c8b92ed

…ix Deployment

Ronniexie added a commit to Ronniexie/llumnix that referenced this issue Feb 24, 2025

Fix bug AlibabaPAI#100: Performance Comparison Between vLLM and Llumn…

61597dd

…ix Deployment

s5u13b mentioned this issue Feb 26, 2025

[Core] Support async output processing when disabling migration #114

Merged

s5u13b closed this as completed Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Comparison Between vLLM and Llumnix Deployment #100

Performance Comparison Between vLLM and Llumnix Deployment #100

Ronniexie commented Feb 11, 2025

ZeldaHuang commented Feb 12, 2025

ZeldaHuang commented Feb 12, 2025

Ronniexie commented Feb 14, 2025

s5u13b commented Feb 21, 2025

Ronniexie commented Feb 21, 2025

s5u13b commented Feb 21, 2025

s5u13b commented Feb 26, 2025

Performance Comparison Between vLLM and Llumnix Deployment #100

Performance Comparison Between vLLM and Llumnix Deployment #100

Comments

Ronniexie commented Feb 11, 2025

ZeldaHuang commented Feb 12, 2025

ZeldaHuang commented Feb 12, 2025

Ronniexie commented Feb 14, 2025

s5u13b commented Feb 21, 2025

Ronniexie commented Feb 21, 2025

s5u13b commented Feb 21, 2025

s5u13b commented Feb 26, 2025