Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Comparison Between vLLM and Llumnix Deployment #100

Closed
Ronniexie opened this issue Feb 11, 2025 · 7 comments
Closed

Performance Comparison Between vLLM and Llumnix Deployment #100

Ronniexie opened this issue Feb 11, 2025 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@Ronniexie
Copy link
Contributor

We encountered an issue during testing with Llumnix:

Comparison of Two Deployment Modes:
1. Direct vLLM deployment (5 instances)
2. Llumnix deployment (5 instances) with only request load enabled (--dispatch-policy load) and migration disabled (--enable-migration)

Test Setup:
• Instances: 5
• Model: qwen2-7b
• GPU: 4090 24GB
• Concurrency: 120
• vLLM version: v0.6.3.post1

Findings:
We observed that the TTFT and TPOT for Llumnix are about 20% lower compared to vLLM (direct deployment). Additionally, the GPU usage is lower in Llumnix compared to vLLM.

Possible Cause:
We suspect the performance difference could be related to the Ray framework being used in Llumnix.

Questions:
• Have you done any performance comparisons like this before?
• Do you have any recommendations or insights to improve performance?

@ZeldaHuang
Copy link
Contributor

Hi, can you share the scripts and parameters for vLLM/llumnix deployment test, we have observed some performance degradation under high concurrency (mainly due to the overhead of ray rpc), and we are working on resolving it.

@ZeldaHuang ZeldaHuang added the enhancement New feature or request label Feb 12, 2025
@ZeldaHuang ZeldaHuang marked this as a duplicate of #99 Feb 12, 2025
@ZeldaHuang ZeldaHuang marked this as a duplicate of #98 Feb 12, 2025
@ZeldaHuang
Copy link
Contributor

Findings:
We observed that the TTFT and TPOT for Llumnix are about 20% lower compared to vLLM (direct deployment). Additionally, the GPU usage is lower in Llumnix compared to vLLM.

Does it mean llumnix has better performance(lower latency)? We haven't do comparison tests with direct vLLM deployment after upgrading to vLLM v0.6.3.post1, in previous version we haven't observe any performance issue compared with vLLM.

@Ronniexie
Copy link
Contributor Author

Sorry for the confusion earlier! I meant to say that the TTFT and TPOT for Llumnix are about 20% higher than for vLLM (direct deployment), indicating that Llumnix performs worse than vLLM directly.

For the test, we used the following parameters:

--tensor-parallel-size 1 --block-size 32 --cpu-offload-gb 0 --dtype bfloat16 --enable-prefix-caching --gpu-memory-utilization 0.83 --guided-decoding-backend outlines --kv-cache-dtype auto --load-format auto --max-logprobs 20 --max-model-len 4096 --max-num-batched-tokens 8096 --max-num-seqs 1024 --max-seq-len-to-capture 8192 --num-lookahead-slots 0 --num-scheduler-steps 8 --response-role assistant --scheduler-delay-factor 0 --seed 0 --swap-space 8 --tokenizer-mode auto

For example, with single-instance tests, we observed that the GPU utilization in vLLM can maintain around 98%, but in Llumnix, it’s only about 96%.

Here are the results for each system:

vLLM (GPU 98%):
• “mean_ttft”: 572.82
• “median_tpot”: 21.78
• “p90_ttft”: 675.15
• “p90_tpot”: 27.68

Llumnix (GPU 96%):
• “mean_ttft”: 611.42
• “median_tpot”: 22.37
• “p90_ttft”: 741.75
• “p90_tpot”: 27.68

Additionally, we tried modifying the executor_class from LumnixRayGPUExecutor to GPUExecutorAsync to reduce the impact of RPC, but the performance change was minimal.

May I ask what other investigation ideas are available? If so, we can assist in testing and verification.

@s5u13b
Copy link
Contributor

s5u13b commented Feb 21, 2025

Sorry for the confusion earlier! I meant to say that the TTFT and TPOT for Llumnix are about 20% higher than for vLLM (direct deployment), indicating that Llumnix performs worse than vLLM directly.

For the test, we used the following parameters:

--tensor-parallel-size 1 --block-size 32 --cpu-offload-gb 0 --dtype bfloat16 --enable-prefix-caching --gpu-memory-utilization 0.83 --guided-decoding-backend outlines --kv-cache-dtype auto --load-format auto --max-logprobs 20 --max-model-len 4096 --max-num-batched-tokens 8096 --max-num-seqs 1024 --max-seq-len-to-capture 8192 --num-lookahead-slots 0 --num-scheduler-steps 8 --response-role assistant --scheduler-delay-factor 0 --seed 0 --swap-space 8 --tokenizer-mode auto

For example, with single-instance tests, we observed that the GPU utilization in vLLM can maintain around 98%, but in Llumnix, it’s only about 96%.

Here are the results for each system:

vLLM (GPU 98%): • “mean_ttft”: 572.82 • “median_tpot”: 21.78 • “p90_ttft”: 675.15 • “p90_tpot”: 27.68

Llumnix (GPU 96%): • “mean_ttft”: 611.42 • “median_tpot”: 22.37 • “p90_ttft”: 741.75 • “p90_tpot”: 27.68

Additionally, we tried modifying the executor_class from LumnixRayGPUExecutor to GPUExecutorAsync to reduce the impact of RPC, but the performance change was minimal.

May I ask what other investigation ideas are available? If so, we can assist in testing and verification.

We have done some benchmarks recently and found similar performance degradation compared to the direct deployment of vLLM. However, we have supported the use_ray_spmd_worker and scheduler that only send delta data recently (these two are both features of vLLM v0.6.3.post1), and when we enable these two features, we observe that llumnix's performance is slightly better than vLLM under a 1-instance deployment (request and decode latency is slightly lower than vLLM, and prefill latency is slightly higher). We are working on figuring out the exact reason and will sync with you in time once we have conclusions.

@Ronniexie
Copy link
Contributor Author

We have identified a performance issue using Llumnix via torch.profiler. It turns out that when using Llumnix, the _process_mode_outputs function becomes synchronous, which leads to blocking the execution of the next step, affecting performance.

Image

The root cause appears to be the initialization of self.engine.scheduler with SchedulerLlumnix:

self.engine.scheduler = [SchedulerLlumnix(self.engine.scheduler_config, self.engine.cache_config, self.engine.lora_config) for _ in range(engine_args.pipeline_parallel_size)]

This initialization results in the use_async_output_proc configuration being set to false, which is the cause of the performance degradation.

Would it be possible for me to be added as a Contributor so that I can submit the modified code?

@s5u13b
Copy link
Contributor

s5u13b commented Feb 21, 2025

We have identified a performance issue using Llumnix via torch.profiler. It turns out that when using Llumnix, the _process_mode_outputs function becomes synchronous, which leads to blocking the execution of the next step, affecting performance.

Image The root cause appears to be the initialization of self.engine.scheduler with SchedulerLlumnix:

self.engine.scheduler = [SchedulerLlumnix(self.engine.scheduler_config, self.engine.cache_config, self.engine.lora_config) for _ in range(engine_args.pipeline_parallel_size)]

This initialization results in the use_async_output_proc configuration being set to false, which is the cause of the performance degradation.

Would it be possible for me to be added as a Contributor so that I can submit the modified code?

Thank you for your work! You can fork our repo and submit a pr. By the way, the root cause that you found also explains why the performance degradation does not happen when enabling use_ray_spmd_worker, because vLLM v0.6.3.post1 does not support asynchronous output processing when enabling use_ray_spmd_worker.

Ronniexie pushed a commit to Ronniexie/llumnix that referenced this issue Feb 21, 2025
Ronniexie added a commit to Ronniexie/llumnix that referenced this issue Feb 21, 2025
Ronniexie added a commit to Ronniexie/llumnix that referenced this issue Feb 24, 2025
@s5u13b
Copy link
Contributor

s5u13b commented Feb 26, 2025

We have addressed this issue in pr.

@s5u13b s5u13b closed this as completed Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants