Add Splitwise implementation to vLLM #2809

aashaka · 2024-02-08T01:53:52Z

This PR follows up on #2472 to implement the prompt and token stage parallelism introduced in Splitwise.
On enabling the --sep-prompt-token flag, first half of the workers are assigned to process prompts and the second half perform token sampling. The KV-cache state is communicated over the network in a layer-wise manner as soon as it is ready on the prompt side. We use the MSCCL++ communication library to perform fast asynchronous KV-cache transfers over the IB fabric.

This PR makes the following changes:

Add MSCCL++ support (https://github.com/microsoft/mscclpp)
Adds logic to separate execution of prompt and token workers
Adds per-layer KV-cache transfer
Documents usage for Splitwise

Installation dependencies:

We use the MSCCL++ collective communication library for KV-cache transfers.
Please follow these instructions at MSCCL++ Quickstart or follow the steps below to install it from source:

$ git clone https://github.com/microsoft/mscclpp;
$ mkdir mscclpp/build; cd mscclpp/build; cmake -DCMAKE_BUILD_TYPE=Release ..; make -j;
$ conda install -c conda-forge mpi4py
$ cd ../python; pip install -r requirements_c12.txt;
$ cd ..; pip install -e .

Make sure that $MSCCLPP_HOME is set to the installation directory or run sudo make install

Tests:

This PR has been tested in the following scenarios.

Validating communication of KV cache:

Command used: python tests/distributed/test_kvcache_comm.py
Result: Runs without assertion errors.
Without MSCCL++ environment, no stage parallelism:

Command used: python examples/llm_engine_example_single.py --tensor-parallel-size 8 --model bigscience/bloom
Result: Runs like normal.
With stage parallelism:

Command used: python examples/llm_engine_example_single.py --tensor-parallel-size 8 --model bigscience/bloom --sep-prompt-token
Result: Same output as before.

llm_engine_example_single.py is the llm_engine_example.py with n=1 and deterministic SamplingParameters.

Known issues:

Missing support for n>1 in SamplingParameters
Since the sampling happens in a different token device than before, the sampled output for RANDOM SamplingType is sometimes different. This is likely due to nondeterminism introduced in the exponential_ operation in calculating multinomial of probs.
Number of profiled GPU cache blocks is different with and without stage parallelism

Initialize MSCCL++ communication group if sep_prompt_token is set in ParallelConfig. Also add documentation for MSCCL++ installation.

- Add worker_type to differentiate prompt, token, and mixed workers - Set a driver for each prompt machines and token machines - Allow broadcasts to take a group - Setup KV Cache communication using MSCCL++ - Add test for KV Cache communication

- Obtain `blocks_to_nw` when creating batches in scheduler. Coalesce blocks where possible for fast network transfers. - Use a Sequence to Semaphore Mapper to allow for fine-grained waiting for kv-cache transfer per sequence - Separately run prompt and token workers using the `_run_stage_workers` helper - Populate KVCacheCommunicator for all PagedAttention modules, which allows implementation of layer-wise sends from within `attention.py` - Populate destination rank for Sampler, which will be used as root in `gather` operations. - Fix `tensor_model_parallel_gather` - use global rank instead of group local rank.

casper-hansen · 2024-02-08T15:42:06Z

Splitting the prefilling and decoding on different GPUs is an excellent idea. For example, the prefilling on H100 and decoding on A100 since H100 has 3.43x more compute but only 1.6x more memory bandwidth - meaning that it would be more cost-efficient to use H100 only for prefilling.

In the paper, you demonstrate a 2.35x increase in throughput at same cost or 1.4x higher through at 20% cost saving. Are you able to reproduce these numbers in this PR?

eshachoukse · 2024-02-08T18:31:32Z

Splitting the prefilling and decoding on different GPUs is an excellent idea. For example, the prefilling on H100 and decoding on A100 since H100 has 3.43x more compute but only 1.6x more memory bandwidth - meaning that it would be more cost-efficient to use H100 only for prefilling.

In the paper, you demonstrate a 2.35x increase in throughput at same cost or 1.4x higher through at 20% cost saving. Are you able to reproduce these numbers in this PR?

@casper-hansen In the paper, we use the splitwise simulator to simulate a 40 node cluster - both for the homogenous (Splitwise-AA/HH), and the heterogenous (Splitwise-HHcap/HA) solutions. That simulation allows us to run a production trace through these systems at various requests per second (rps) under SLOs. That is what allows us to calculate the maximum throughput under a given cost/power, leading us to the scaled results.

The code that we just pushed only allows us to build the prototype of that solution, since it does not include the optimized cluster-level scheduler. The prototype has been developed and tested on 1 prompt machine and 1 token machine. Therefore, the main point of this PR is to show the optimized KV cache transfer time, rather than the at-scale results. Hope this answers your question.

zhuohan123 · 2024-02-10T06:15:39Z

@GindaChen Hey Junda can you help take a look at the PR and leave some comments? Thanks!

GindaChen · 2024-02-12T19:41:39Z

@aashaka This is a very promising PR to integrate Splitwise into vLLM! I will try to finish up the review today. From what I have understood, this PR only tries to introduce the splitwise-mode, stage parallelism and KV cache transfer into vLLM.

GindaChen

Awesome changes so far! Please see comments for changes or feedbacks for future design changes.

I may still have another round of review on the comm_utils.py and after running the tests. So far I am stuck at the installation of MSCCL++ library for some weird reason. I can post the error / environment once I figure out a path to testing.

GindaChen · 2024-02-14T05:48:36Z

vllm/worker/worker.py

+            # Populate Sampler with dst_rank as driver worker's rank.
+            self.model_runner.model.sampler.set_dst_rank(self.model_runner.driver_rank)
+
+    def init_mscclpp_comm(self, mscclpp_init_method: Optional[str] = None) -> None:


Recommend separating the initialization of communication as a separate class or function (than being inside the worker). In my opinion, the vLLM codebase may benefit from having a better abstraction for communication, say:

class CommManager: ... def init_comm(...): ... def destroy_comm(...): ... def get_group(self, group_type): ... # say tensor / stage / pipeline parallel group ... class Worker: def __init__(self, ...): self.comm_manager = CommManager(...)

I'm open for having this as a future refactor, but want to see if you think separating this logic now is feasible.

GindaChen · 2024-02-14T06:35:57Z

docs/source/splitwise/installing_mscclpp.rst

+`MSCCL++ <https://github.com/microsoft/mscclpp>`_ is a GPU-driven communication stack for scalable AI applications.
+It is used to implement KV cache communication in Splitwise.
+
+To install MSCCL++, please follow the instructions at  `MSCCL++ Quickstart <https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md>`_ or follow the steps below to install it from source:


Just a heads that that the system may need to apt install libnuma-dev (libnuma1) prior make (I hit this error at installation).

Thanks for the heads up. I will add this to the installation instructions both here and in the MSCCL++ repo. Also, feel free to reach out to me if you have any other problems with MSCCL++ setup.

GindaChen · 2024-02-14T07:31:18Z

vllm/config.py

@@ -369,14 +369,22 @@ def __init__(
        worker_use_ray: bool,
        max_parallel_loading_workers: Optional[int] = None,
        disable_custom_all_reduce: bool = False,
+        sep_prompt_token: bool = False,


Modification on config looks good to me. I guess there may be future extension to pass in the number of prompt / token workers, and I think so far the abstraction looks good.

GindaChen · 2024-02-14T07:42:38Z

vllm/utils.py

+
+class SeqToSlotMapper:
+    """ SeqToSlotMapper maps sequence ids to a limited set of slot ids.
+    A slot is freed every time a sequence finishes. It is used to manage


Would be helpful to introduce what is a "slot" here.

GindaChen · 2024-02-14T17:31:06Z

vllm/engine/llm_engine.py

+            all_outputs = self._run_stage_workers(
+                "execute_model",
+                prompt_stage=seq_group_metadata_list[0].is_prompt,
+                driver_kwargs={
+                    "seq_group_metadata_list": seq_group_metadata_list,
+                    "blocks_to_swap_in": scheduler_outputs.blocks_to_swap_in,
+                    "blocks_to_swap_out": scheduler_outputs.blocks_to_swap_out,
+                    "blocks_to_copy": scheduler_outputs.blocks_to_copy,
+                    "blocks_to_nw": scheduler_outputs.blocks_to_nw,
+                })


Just want to confirm - this means prompt workers cannot work concurrently with decode workers because each time we only schedule one set of workers to run.

I understand this PR is a prototype to verify KV cache transfer between prompt / decode workers, so performance of concurrent set of workers running isn't the focus. I do want to point out that making workers run concurrently (and the communication between schedulers) turns out to be one of the major challenge in design that potentially break the vLLM current architecture.

I would be more than happy to hear if you have a great solution! I'm also open to talk in detail about our design, and vLLM team's concern about the architectural change.

With that said, I am personally okay leaving this code as is in this PR so we can demonstrate the KV cache transfer.

Recommend adding a TODO to say something like "this doesn't achieve the best performance because we only schedule one set of workers at a time" so people don't get confused.

@GindaChen thanks for all the comments. I have updated the PR accordingly. Please let me know if you have any further comments.

@aashaka Thanks for the heads up! I will take a look at the changes soon!

@GindaChen

this means prompt workers cannot work concurrently with decode workers

I do not quite understand this part. Since decode works are provisioned separately, why can them run concurrently with prompt workers here?

because each time we only schedule one set of workers to run.

one set of workers means <prompt, worker> pair?

What you are trying to say is currently it doesn't support <List, List> to accelerate the prefill and decoding phase? (which requires model parallelism)

valvarl · 2024-02-22T09:51:34Z

Hello, I have encountered a few problems when repeating the results.

First, some difficulty arose when installing the MSCCL++ library. It requires CUDA at least version 11 (I don't know exactly which one), 10.2 will not work for sure. You also need cmake >= 3.25.0. The main difficulty with the installation was to install libibverbs-dev correctly. Version 17.1 of libibverbs-dev will not work because of errors during the building process, so you need at least 28. I was able to build with libibvers-dev>=36.0-1. However, I had to add list(APPEND CMAKE_FIND_LIBRARY_SUFFIXES .so.1) before find_package(IBVerbs REQUIRED) in CMakeLists.txt.

Second, I don't have an "eth0" interface on my machine. There are others in ifconfig, but changing to any other interface in vllm/engine/llm_engine.py:253 results in an initialization error I don't understand:

/workspace/vllm-oss# REQUESTS_CA_BUNDLE="" CURL_CA_BUNDLE="" python tests/distributed/test_kvcache_comm.py 
2024-02-22 09:20:50,793	INFO worker.py:1724 -- Started a local Ray instance.
INFO 02-22 09:20:52 llm_engine.py:73] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Traceback (most recent call last):
 File "/workspace/vllm-oss/tests/distributed/test_kvcache_comm.py", line 36, in <module>
   engine = initialize_engine(args)
 File "/workspace/vllm-oss/tests/distributed/test_kvcache_comm.py", line 14, in initialize_engine
   return LLMEngine.from_engine_args(engine_args)
 File "/workspace/vllm-oss/vllm/engine/llm_engine.py", line 393, in from_engine_args
   engine = cls(*engine_configs,
 File "/workspace/vllm-oss/vllm/engine/llm_engine.py", line 112, in __init__
   self._init_workers_ray(placement_group)
 File "/workspace/vllm-oss/vllm/engine/llm_engine.py", line 300, in _init_workers_ray
   self._run_workers("init_model")
 File "/workspace/vllm-oss/vllm/engine/llm_engine.py", line 1038, in _run_workers
   driver_worker_output = getattr(self.driver_worker,
 File "/workspace/vllm-oss/vllm/worker/worker.py", line 108, in init_model
   self.init_kvcache_comm(self.mscclpp_init_method)
 File "/workspace/vllm-oss/vllm/worker/worker.py", line 126, in init_kvcache_comm
   self.kvcache_comm_manager = KVCacheCommManager(
 File "/workspace/vllm-oss/vllm/worker/comm_utils.py", line 168, in __init__
   self.mscclpp_conns = self.mscclpp_group.make_connection(
 File "/workspace/mscclpp/python/mscclpp/comm.py", line 102, in make_connection
   connections[rank] = self.communicator.connect_on_setup(rank, 0, endpoint)
IndexError: IB transport out of range: 0 >= 0
nanobind: leaked 1 instances!
nanobind: leaked 1 keep_alive records!
nanobind: leaked 2 types!
- leaked type "mscclpp._mscclpp.Bootstrap"
- leaked type "mscclpp._mscclpp.TcpBootstrap"
nanobind: leaked 11 functions!
- leaked function "initialize"
- leaked function "__init__"
- leaked function "get_rank"
- leaked function "get_n_ranks"
- leaked function "get_unique_id"
- leaked function "create"
- leaked function "barrier"
- leaked function "create_unique_id"
- leaked function "send"
- leaked function "all_gather"
- leaked function "recv"
- ... skipped remainder
nanobind: this is likely caused by a reference counting issue in the binding code.

Could you give me a hint, maybe some profiling information can be gathered to see what the problem might be.

aashaka · 2024-02-23T17:48:15Z

Hello, I have encountered a few problems when repeating the results.
...
Could you give me a hint, maybe some profiling information can be gathered to see what the problem might be.

@valvarl, the communication setup will require InfiniBand support. Looks like ibv_get_device_list is returning 0 meaning that it is unable to find any IB devices.

leiwen83 · 2024-02-25T01:43:16Z

Hello, I have encountered a few problems when repeating the results.
...
Could you give me a hint, maybe some profiling information can be gathered to see what the problem might be.

@valvarl, the communication setup will require InfiniBand support. Looks like ibv_get_device_list is returning 0 meaning that it is unable to find any IB devices.

Hi,

Does it means that InfiniBand support is must to have to enable splitwise feature?

valvarl · 2024-02-26T08:17:47Z

Hello, I have encountered a few problems when repeating the results.
...
Could you give me a hint, maybe some profiling information can be gathered to see what the problem might be.

@valvarl, the communication setup will require InfiniBand support. Looks like ibv_get_device_list is returning 0 meaning that it is unable to find any IB devices.

Unfortunately, I don't have physical access to the server. However, I can see some information about available connections on it.

nvidia-smi topo -m

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	CPU Affinity	NUMA Affinity
GPU0	 X 	NV8	NV8	NV8	NV8	NV8	NV8	NV8	PXB	PXB	NODE	NODE	SYS	SYS	SYS	SYS	0-31,64-95	0
GPU1	NV8	 X 	NV8	NV8	NV8	NV8	NV8	NV8	PXB	PXB	NODE	NODE	SYS	SYS	SYS	SYS	0-31,64-95	0
GPU2	NV8	NV8	 X 	NV8	NV8	NV8	NV8	NV8	NODE	NODE	PXB	PXB	SYS	SYS	SYS	SYS	0-31,64-95	0
GPU3	NV8	NV8	NV8	 X 	NV8	NV8	NV8	NV8	NODE	NODE	PXB	PXB	SYS	SYS	SYS	SYS	0-31,64-95	0
GPU4	NV8	NV8	NV8	NV8	 X 	NV8	NV8	NV8	SYS	SYS	SYS	SYS	PXB	PXB	NODE	NODE	32-63,96-127	1
GPU5	NV8	NV8	NV8	NV8	NV8	 X 	NV8	NV8	SYS	SYS	SYS	SYS	PXB	PXB	NODE	NODE	32-63,96-127	1
GPU6	NV8	NV8	NV8	NV8	NV8	NV8	 X 	NV8	SYS	SYS	SYS	SYS	NODE	NODE	PXB	PXB	32-63,96-127	1
GPU7	NV8	NV8	NV8	NV8	NV8	NV8	NV8	 X 	SYS	SYS	SYS	SYS	NODE	NODE	PXB	PXB	32-63,96-127	1
NIC0	PXB	PXB	NODE	NODE	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE	SYS	SYS	SYS	SYS		
NIC1	PXB	PXB	NODE	NODE	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE	SYS	SYS	SYS	SYS		
NIC2	NODE	NODE	PXB	PXB	SYS	SYS	SYS	SYS	NODE	NODE	 X 	PIX	SYS	SYS	SYS	SYS		
NIC3	NODE	NODE	PXB	PXB	SYS	SYS	SYS	SYS	NODE	NODE	PIX	 X 	SYS	SYS	SYS	SYS		
NIC4	SYS	SYS	SYS	SYS	PXB	PXB	NODE	NODE	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE		
NIC5	SYS	SYS	SYS	SYS	PXB	PXB	NODE	NODE	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE		
NIC6	SYS	SYS	SYS	SYS	NODE	NODE	PXB	PXB	SYS	SYS	SYS	SYS	NODE	NODE	 X 	PIX		
NIC7	SYS	SYS	SYS	SYS	NODE	NODE	PXB	PXB	SYS	SYS	SYS	SYS	NODE	NODE	PIX	 X 		

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7

ibstat -v
CA 'mlx5_0'
	CA type: MT4125
	Number of ports: 1
	Firmware version: 22.31.2006
	Hardware version: 0
	Node GUID: ...
	System image GUID: ...
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: ...
		Link layer: Ethernet
<...>

ibv_devinfo
No IB devices found

I am not familiar with RDMA library and I don't understand how to use ibv_get_device_list properly. Maybe you can add a test to determine if InfiniBand is configured correctly? Perhaps there is some command that allows you to test the connection automatically?

aashaka · 2024-02-29T19:07:49Z

Hi,

Does it means that InfiniBand support is must to have to enable splitwise feature?

@leiwen83 currently, yes that is the case. While it has been low-priority item for us, we do have a plan to support Ethernet in the future.

nanomer · 2024-03-01T16:48:23Z

Hi @aashaka, what are the system requirements to run this Splitwise implementation:

It seems Infiniband is a requirement (no Ethernet support at the moment).
Are there any GPU limitations? A100s/H100s are probably supported, but would this implementation also work on A10/A40s or V100s?
Anything else I might be missing?

oguzhannfsgl · 2024-03-27T11:47:40Z

Hi. I was trying to run this PR on my machine. I tested the code by running 'tests/distributed/test_kvcache_comm.py'. But, i am getting this error below:

(RayWorkerVllm pid=2349994) .../lib/python3.10/site-packages/mscclpp/include/mscclpp/fifo_device.hpp:88: void mscclpp::FifoDeviceHandle::sync(unsigned long, signed long): block: [0,0,0], thread: [0,0,0] Assertion `(curFifoHead >= atomicLoad(this->tailReplica, memoryOrderRelaxed))(atomicLoad(&(this->triggers[curFifoHead % size].fst), memoryOrderRelaxed) != 0)` failed.
.../lib/python3.10/site-packages/mscclpp/include/mscclpp/fifo_device.hpp:88: void mscclpp::FifoDeviceHandle::sync(unsigned long, signed long): block: [0,0,0], thread: [0,0,0] Assertion `(curFifoHead >= atomicLoad(this->tailReplica, memoryOrderRelaxed))(atomicLoad(&(this->triggers[curFifoHead % size].fst), memoryOrderRelaxed) != 0)` failed.
Traceback (most recent call last):
  File "/workspace/vLLM/vllm-oss-splitwise-pr/tests/distributed/test_kvcache_comm.py", line 42, in <module>
    run_all_workers(engine, "send_recv_kvcache_all")
  File "/workspace/vLLM/vllm-oss-splitwise-pr/tests/distributed/test_kvcache_comm.py", line 26, in run_all_workers
    _ = getattr(engine.driver_worker, method)(*args)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/worker.py", line 351, in send_recv_kvcache_all
    self.kvcache_comm_manager.signal_and_flush(0)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/comm_utils.py", line 224, in signal_and_flush
    self.kvcache_comm.signal_and_flush(sem_id)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/comm_utils.py", line 131, in signal_and_flush
    self.signal_kernel(params)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/comm_utils.py", line 63, in __call__
    return self._kernel.launch_kernel(params,
  File ".../lib/python3.10/site-packages/mscclpp/utils.py", line 54, in launch_kernel
    cp.cuda.driver.launchKernel(
  File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.launchKernel
  File "cupy_backends/cuda/api/driver.pyx", line 273, in cupy_backends.cuda.api.driver.launchKernel
  File "cupy_backends/cuda/api/driver.pyx", line 63, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
(RayWorkerVllm pid=2349994) INFO 03-27 18:29:33 custom_all_reduce.py:202] Registering 2275 cuda graph addresses [repeated 2x across cluster]
(RayWorkerVllm pid=2349994) INFO 03-27 18:29:33 model_runner.py:724] Graph capturing finished in 6 secs. [repeated 2x across cluster]
[2024-03-27 18:29:51,959 E 2342952 2350727] logging.cc:97: Unhandled exception: N7mscclpp7IbErrorE. what(): a work item failed: status 12 (Ib failure: Cannot allocate memory)
[2024-03-27 18:29:51,963 E 2342952 2350727] logging.cc:104: Stack trace:
 .../lib/python3.10/site-packages/ray/_raylet.so(+0xfebc9a) [0x7f30da099c9a] ray::operator<<()
.../lib/python3.10/site-packages/ray/_raylet.so(+0xfee3d8) [0x7f30da09c3d8] ray::TerminateHandler()
.../bin/../lib/libstdc++.so.6(+0xb643c) [0x7f30d8f6343c] __cxxabiv1::__terminate()
.../bin/../lib/libstdc++.so.6(+0xb648e) [0x7f30d8f6348e] __cxxabiv1::__unexpected()
.../bin/../lib/libstdc++.so.6(__cxa_rethrow+0) [0x7f30d8f63680] __cxa_rethrow
.../lib/python3.10/site-packages/mscclpp/_mscclpp.cpython-310-x86_64-linux-gnu.so(+0x24516) [0x7f1e742c9516]
.../lib/python3.10/site-packages/mscclpp/_mscclpp.cpython-310-x86_64-linux-gnu.so(_ZN7mscclpp12ProxyService13handleTriggerENS_12ProxyTriggerE+0x28d) [0x7f1e7430a7ed] mscclpp::ProxyService::handleTrigger()
.../lib/python3.10/site-packages/mscclpp/_mscclpp.cpython-310-x86_64-linux-gnu.so(+0x6ca92) [0x7f1e74311a92]
.../bin/../lib/libstdc++.so.6(+0xd3e95) [0x7f30d8f80e95] execute_native_thread_routine
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f30db2fa609] start_thread
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f30db0c5133] __clone

Aborted (core dumped)

Seems like kvcache_comm_manager.put method works without any problem (inside the attention.py). But, kvcache_comm_manager.signal_and_flush gets this error inside worker.py .

I couldn't figure out the source of the problem. Does this error message say something to you ?

Jeffwan · 2024-04-17T17:50:02Z

vllm/config.py


        self.world_size = pipeline_parallel_size * tensor_parallel_size
+        if sep_prompt_token:
+            # Half of the workers are prompt workers and the other half are token


shout it be the fixed size? If the workers use exact same gpus. seems prompt may need more works?

Jeffwan · 2024-04-17T18:07:42Z

vllm/engine/llm_engine.py

@@ -116,8 +116,13 @@ def __init__(
        # Profile the memory usage and initialize the cache.
        self._init_cache()

+        if self.parallel_config.sep_prompt_token:
+            # Setup the MSCCL++ communication required for KV cache transfer
+            self._setup_kvcache_comm()


just curious why do we need MSCCL for communication, any other options like torch.distributed with nccl backend?. Any analysis on the communication collective library used for KV cache transfer?

Jeffwan · 2024-04-17T18:14:51Z

vllm/engine/llm_engine.py

@@ -229,6 +250,7 @@ def _init_workers_ray(self, placement_group: "PlacementGroup",

        distributed_init_method = get_distributed_init_method(
            driver_ip, get_open_port())
+        mscclpp_init_method = f"eth0:{driver_ip}:{get_open_port()}" if self.parallel_config.sep_prompt_token else None


minor: can we leave a TODO here.

I feel this is limited if we want to use high speed network interface. It would be great to extract and env like NCCL_SOCKET_IFNAME

Do not know whether mscclpp is compatible with other high speed interfaces?

Besides, I don't think everyone's default network interface would have eth0. Agree with @Jeffwan's suggestion. Or at least let user to specify which network interface they would like to use.

Jeffwan · 2024-04-17T18:25:05Z

vllm/engine/llm_engine.py

+            all_outputs = self._run_stage_workers(
+                "execute_model",
+                prompt_stage=seq_group_metadata_list[0].is_prompt,
+                driver_kwargs={
+                    "seq_group_metadata_list": seq_group_metadata_list,
+                    "blocks_to_swap_in": scheduler_outputs.blocks_to_swap_in,
+                    "blocks_to_swap_out": scheduler_outputs.blocks_to_swap_out,
+                    "blocks_to_copy": scheduler_outputs.blocks_to_copy,
+                    "blocks_to_nw": scheduler_outputs.blocks_to_nw,
+                })


@GindaChen

this means prompt workers cannot work concurrently with decode workers

I do not quite understand this part. Since decode works are provisioned separately, why can them run concurrently with prompt workers here?

because each time we only schedule one set of workers to run.

one set of workers means <prompt, worker> pair?

What you are trying to say is currently it doesn't support <List, List> to accelerate the prefill and decoding phase? (which requires model parallelism)

Jeffwan · 2024-04-17T19:01:39Z

@eshachoukse

The code that we just pushed only allows us to build the prototype of that solution, since it does not include the optimized cluster-level scheduler.

What are the features provided by your cluster-level scheduler in this case? something like prompt and decode machine collaboration?

Jeffwan · 2024-04-17T19:09:11Z

@aashaka

Currently, yes that is the case. While it has been low-priority item for us, we do have a plan to support Ethernet in the future.

I am trying to get more details here. Seems eth0 instead of ib device is chosen here. Why ib device is required in this case? Or why not explicit ib0?

mscclpp_init_method = f"eth0:{driver_ip}:{get_open_port()}" if self.parallel_config.sep_prompt_token else None

Jeffwan · 2024-04-17T20:39:42Z

vllm/worker/worker.py

+                    len(HEAD_TYPES)) + layer_id * len(HEAD_TYPES) + head_type
+        torch.cuda.synchronize()
+
+    def send_recv_kvcache_all(self):


is this method only use in testing? I can not find any other references. If so, can you add some comments?

irasin · 2024-06-02T08:20:36Z

Hi, all, I was wondering when will it be merged into the main branch?

vie-serendipity · 2024-06-12T07:08:30Z

any update?

linstreamer · 2024-06-17T03:23:43Z

@eshachoukse

The code that we just pushed only allows us to build the prototype of that solution, since it does not include the optimized cluster-level scheduler.

What are the features provided by your cluster-level scheduler in this case? something like prompt and decode machine collaboration?

Same question: will the cluster-level scheduler be released in the vllm repo or another repo? @aashaka

CSEEduanyu · 2024-06-23T13:00:51Z

@aashaka @Jeffwan Hello, what is the current status? Can this branch be used online? For example, how much higher is the A100 prefill/decode=1/1 with 2 cards than the original version?

cassiewilliam · 2024-07-11T06:52:11Z

Hi. I was trying to run this PR on my machine. I tested the code by running 'tests/distributed/test_kvcache_comm.py'. But, i am getting this error below:

(RayWorkerVllm pid=2349994) .../lib/python3.10/site-packages/mscclpp/include/mscclpp/fifo_device.hpp:88: void mscclpp::FifoDeviceHandle::sync(unsigned long, signed long): block: [0,0,0], thread: [0,0,0] Assertion `(curFifoHead >= atomicLoad(this->tailReplica, memoryOrderRelaxed))(atomicLoad(&(this->triggers[curFifoHead % size].fst), memoryOrderRelaxed) != 0)` failed.
.../lib/python3.10/site-packages/mscclpp/include/mscclpp/fifo_device.hpp:88: void mscclpp::FifoDeviceHandle::sync(unsigned long, signed long): block: [0,0,0], thread: [0,0,0] Assertion `(curFifoHead >= atomicLoad(this->tailReplica, memoryOrderRelaxed))(atomicLoad(&(this->triggers[curFifoHead % size].fst), memoryOrderRelaxed) != 0)` failed.
Traceback (most recent call last):
  File "/workspace/vLLM/vllm-oss-splitwise-pr/tests/distributed/test_kvcache_comm.py", line 42, in <module>
    run_all_workers(engine, "send_recv_kvcache_all")
  File "/workspace/vLLM/vllm-oss-splitwise-pr/tests/distributed/test_kvcache_comm.py", line 26, in run_all_workers
    _ = getattr(engine.driver_worker, method)(*args)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/worker.py", line 351, in send_recv_kvcache_all
    self.kvcache_comm_manager.signal_and_flush(0)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/comm_utils.py", line 224, in signal_and_flush
    self.kvcache_comm.signal_and_flush(sem_id)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/comm_utils.py", line 131, in signal_and_flush
    self.signal_kernel(params)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/comm_utils.py", line 63, in __call__
    return self._kernel.launch_kernel(params,
  File ".../lib/python3.10/site-packages/mscclpp/utils.py", line 54, in launch_kernel
    cp.cuda.driver.launchKernel(
  File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.launchKernel
  File "cupy_backends/cuda/api/driver.pyx", line 273, in cupy_backends.cuda.api.driver.launchKernel
  File "cupy_backends/cuda/api/driver.pyx", line 63, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
(RayWorkerVllm pid=2349994) INFO 03-27 18:29:33 custom_all_reduce.py:202] Registering 2275 cuda graph addresses [repeated 2x across cluster]
(RayWorkerVllm pid=2349994) INFO 03-27 18:29:33 model_runner.py:724] Graph capturing finished in 6 secs. [repeated 2x across cluster]
[2024-03-27 18:29:51,959 E 2342952 2350727] logging.cc:97: Unhandled exception: N7mscclpp7IbErrorE. what(): a work item failed: status 12 (Ib failure: Cannot allocate memory)
[2024-03-27 18:29:51,963 E 2342952 2350727] logging.cc:104: Stack trace:
 .../lib/python3.10/site-packages/ray/_raylet.so(+0xfebc9a) [0x7f30da099c9a] ray::operator<<()
.../lib/python3.10/site-packages/ray/_raylet.so(+0xfee3d8) [0x7f30da09c3d8] ray::TerminateHandler()
.../bin/../lib/libstdc++.so.6(+0xb643c) [0x7f30d8f6343c] __cxxabiv1::__terminate()
.../bin/../lib/libstdc++.so.6(+0xb648e) [0x7f30d8f6348e] __cxxabiv1::__unexpected()
.../bin/../lib/libstdc++.so.6(__cxa_rethrow+0) [0x7f30d8f63680] __cxa_rethrow
.../lib/python3.10/site-packages/mscclpp/_mscclpp.cpython-310-x86_64-linux-gnu.so(+0x24516) [0x7f1e742c9516]
.../lib/python3.10/site-packages/mscclpp/_mscclpp.cpython-310-x86_64-linux-gnu.so(_ZN7mscclpp12ProxyService13handleTriggerENS_12ProxyTriggerE+0x28d) [0x7f1e7430a7ed] mscclpp::ProxyService::handleTrigger()
.../lib/python3.10/site-packages/mscclpp/_mscclpp.cpython-310-x86_64-linux-gnu.so(+0x6ca92) [0x7f1e74311a92]
.../bin/../lib/libstdc++.so.6(+0xd3e95) [0x7f30d8f80e95] execute_native_thread_routine
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f30db2fa609] start_thread
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f30db0c5133] __clone

Aborted (core dumped)

Seems like kvcache_comm_manager.put method works without any problem (inside the attention.py). But, kvcache_comm_manager.signal_and_flush gets this error inside worker.py .

I couldn't figure out the source of the problem. Does this error message say something to you ?

have the same issue

cassiewilliam · 2024-07-11T06:55:21Z

Hi. I was trying to run this PR on my machine. I tested the code by running 'tests/distributed/test_kvcache_comm.py'. But, i am getting this error below:

(RayWorkerVllm pid=2349994) .../lib/python3.10/site-packages/mscclpp/include/mscclpp/fifo_device.hpp:88: void mscclpp::FifoDeviceHandle::sync(unsigned long, signed long): block: [0,0,0], thread: [0,0,0] Assertion `(curFifoHead >= atomicLoad(this->tailReplica, memoryOrderRelaxed))(atomicLoad(&(this->triggers[curFifoHead % size].fst), memoryOrderRelaxed) != 0)` failed.
.../lib/python3.10/site-packages/mscclpp/include/mscclpp/fifo_device.hpp:88: void mscclpp::FifoDeviceHandle::sync(unsigned long, signed long): block: [0,0,0], thread: [0,0,0] Assertion `(curFifoHead >= atomicLoad(this->tailReplica, memoryOrderRelaxed))(atomicLoad(&(this->triggers[curFifoHead % size].fst), memoryOrderRelaxed) != 0)` failed.
Traceback (most recent call last):
  File "/workspace/vLLM/vllm-oss-splitwise-pr/tests/distributed/test_kvcache_comm.py", line 42, in <module>
    run_all_workers(engine, "send_recv_kvcache_all")
  File "/workspace/vLLM/vllm-oss-splitwise-pr/tests/distributed/test_kvcache_comm.py", line 26, in run_all_workers
    _ = getattr(engine.driver_worker, method)(*args)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/worker.py", line 351, in send_recv_kvcache_all
    self.kvcache_comm_manager.signal_and_flush(0)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/comm_utils.py", line 224, in signal_and_flush
    self.kvcache_comm.signal_and_flush(sem_id)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/comm_utils.py", line 131, in signal_and_flush
    self.signal_kernel(params)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/comm_utils.py", line 63, in __call__
    return self._kernel.launch_kernel(params,
  File ".../lib/python3.10/site-packages/mscclpp/utils.py", line 54, in launch_kernel
    cp.cuda.driver.launchKernel(
  File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.launchKernel
  File "cupy_backends/cuda/api/driver.pyx", line 273, in cupy_backends.cuda.api.driver.launchKernel
  File "cupy_backends/cuda/api/driver.pyx", line 63, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
(RayWorkerVllm pid=2349994) INFO 03-27 18:29:33 custom_all_reduce.py:202] Registering 2275 cuda graph addresses [repeated 2x across cluster]
(RayWorkerVllm pid=2349994) INFO 03-27 18:29:33 model_runner.py:724] Graph capturing finished in 6 secs. [repeated 2x across cluster]
[2024-03-27 18:29:51,959 E 2342952 2350727] logging.cc:97: Unhandled exception: N7mscclpp7IbErrorE. what(): a work item failed: status 12 (Ib failure: Cannot allocate memory)
[2024-03-27 18:29:51,963 E 2342952 2350727] logging.cc:104: Stack trace:
 .../lib/python3.10/site-packages/ray/_raylet.so(+0xfebc9a) [0x7f30da099c9a] ray::operator<<()
.../lib/python3.10/site-packages/ray/_raylet.so(+0xfee3d8) [0x7f30da09c3d8] ray::TerminateHandler()
.../bin/../lib/libstdc++.so.6(+0xb643c) [0x7f30d8f6343c] __cxxabiv1::__terminate()
.../bin/../lib/libstdc++.so.6(+0xb648e) [0x7f30d8f6348e] __cxxabiv1::__unexpected()
.../bin/../lib/libstdc++.so.6(__cxa_rethrow+0) [0x7f30d8f63680] __cxa_rethrow
.../lib/python3.10/site-packages/mscclpp/_mscclpp.cpython-310-x86_64-linux-gnu.so(+0x24516) [0x7f1e742c9516]
.../lib/python3.10/site-packages/mscclpp/_mscclpp.cpython-310-x86_64-linux-gnu.so(_ZN7mscclpp12ProxyService13handleTriggerENS_12ProxyTriggerE+0x28d) [0x7f1e7430a7ed] mscclpp::ProxyService::handleTrigger()
.../lib/python3.10/site-packages/mscclpp/_mscclpp.cpython-310-x86_64-linux-gnu.so(+0x6ca92) [0x7f1e74311a92]
.../bin/../lib/libstdc++.so.6(+0xd3e95) [0x7f30d8f80e95] execute_native_thread_routine
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f30db2fa609] start_thread
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f30db0c5133] __clone

Aborted (core dumped)

Seems like kvcache_comm_manager.put method works without any problem (inside the attention.py). But, kvcache_comm_manager.signal_and_flush gets this error inside worker.py .

I couldn't figure out the source of the problem. Does this error message say something to you ?

do you fix this problem, maybe you can help me, please

chenhongyu2048 · 2024-09-06T07:42:19Z

Hi. I was trying to run this PR on my machine. I tested the code by running 'tests/distributed/test_kvcache_comm.py'. But, i am getting this error below:

(RayWorkerVllm pid=2349994) .../lib/python3.10/site-packages/mscclpp/include/mscclpp/fifo_device.hpp:88: void mscclpp::FifoDeviceHandle::sync(unsigned long, signed long): block: [0,0,0], thread: [0,0,0] Assertion `(curFifoHead >= atomicLoad(this->tailReplica, memoryOrderRelaxed))(atomicLoad(&(this->triggers[curFifoHead % size].fst), memoryOrderRelaxed) != 0)` failed.
.../lib/python3.10/site-packages/mscclpp/include/mscclpp/fifo_device.hpp:88: void mscclpp::FifoDeviceHandle::sync(unsigned long, signed long): block: [0,0,0], thread: [0,0,0] Assertion `(curFifoHead >= atomicLoad(this->tailReplica, memoryOrderRelaxed))(atomicLoad(&(this->triggers[curFifoHead % size].fst), memoryOrderRelaxed) != 0)` failed.
Traceback (most recent call last):
  File "/workspace/vLLM/vllm-oss-splitwise-pr/tests/distributed/test_kvcache_comm.py", line 42, in <module>
    run_all_workers(engine, "send_recv_kvcache_all")
  File "/workspace/vLLM/vllm-oss-splitwise-pr/tests/distributed/test_kvcache_comm.py", line 26, in run_all_workers
    _ = getattr(engine.driver_worker, method)(*args)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/worker.py", line 351, in send_recv_kvcache_all
    self.kvcache_comm_manager.signal_and_flush(0)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/comm_utils.py", line 224, in signal_and_flush
    self.kvcache_comm.signal_and_flush(sem_id)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/comm_utils.py", line 131, in signal_and_flush
    self.signal_kernel(params)
  File "/workspace/vLLM/vllm-oss-splitwise-pr/vllm/worker/comm_utils.py", line 63, in __call__
    return self._kernel.launch_kernel(params,
  File ".../lib/python3.10/site-packages/mscclpp/utils.py", line 54, in launch_kernel
    cp.cuda.driver.launchKernel(
  File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.launchKernel
  File "cupy_backends/cuda/api/driver.pyx", line 273, in cupy_backends.cuda.api.driver.launchKernel
  File "cupy_backends/cuda/api/driver.pyx", line 63, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ASSERT: device-side assert triggered
(RayWorkerVllm pid=2349994) INFO 03-27 18:29:33 custom_all_reduce.py:202] Registering 2275 cuda graph addresses [repeated 2x across cluster]
(RayWorkerVllm pid=2349994) INFO 03-27 18:29:33 model_runner.py:724] Graph capturing finished in 6 secs. [repeated 2x across cluster]
[2024-03-27 18:29:51,959 E 2342952 2350727] logging.cc:97: Unhandled exception: N7mscclpp7IbErrorE. what(): a work item failed: status 12 (Ib failure: Cannot allocate memory)
[2024-03-27 18:29:51,963 E 2342952 2350727] logging.cc:104: Stack trace:
 .../lib/python3.10/site-packages/ray/_raylet.so(+0xfebc9a) [0x7f30da099c9a] ray::operator<<()
.../lib/python3.10/site-packages/ray/_raylet.so(+0xfee3d8) [0x7f30da09c3d8] ray::TerminateHandler()
.../bin/../lib/libstdc++.so.6(+0xb643c) [0x7f30d8f6343c] __cxxabiv1::__terminate()
.../bin/../lib/libstdc++.so.6(+0xb648e) [0x7f30d8f6348e] __cxxabiv1::__unexpected()
.../bin/../lib/libstdc++.so.6(__cxa_rethrow+0) [0x7f30d8f63680] __cxa_rethrow
.../lib/python3.10/site-packages/mscclpp/_mscclpp.cpython-310-x86_64-linux-gnu.so(+0x24516) [0x7f1e742c9516]
.../lib/python3.10/site-packages/mscclpp/_mscclpp.cpython-310-x86_64-linux-gnu.so(_ZN7mscclpp12ProxyService13handleTriggerENS_12ProxyTriggerE+0x28d) [0x7f1e7430a7ed] mscclpp::ProxyService::handleTrigger()
.../lib/python3.10/site-packages/mscclpp/_mscclpp.cpython-310-x86_64-linux-gnu.so(+0x6ca92) [0x7f1e74311a92]
.../bin/../lib/libstdc++.so.6(+0xd3e95) [0x7f30d8f80e95] execute_native_thread_routine
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f30db2fa609] start_thread
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f30db0c5133] __clone

Aborted (core dumped)

Seems like kvcache_comm_manager.put method works without any problem (inside the attention.py). But, kvcache_comm_manager.signal_and_flush gets this error inside worker.py .
I couldn't figure out the source of the problem. Does this error message say something to you ?

do you fix this problem, maybe you can help me, please

Seems a problem of mscclpp. In mscclpp github, there are some similar issues.

JHC521PJJ · 2024-11-07T09:26:43Z

I currently have a question: the paper states that per-layer can accelerate GPU memory release during the prompt phase, but I seem to be unable to find where this is implemented in the code. Could you please clarify whether the GPU memory is immediately released after completing the KV transfer for a certain layer, or is it retained until the end of the prompt phase?

mergify · 2024-11-26T05:52:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aashaka.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Maphsge4 · 2024-12-04T08:09:57Z

If there isn't any InfiniBand or NVLink on my machine, how can I use this technique to separate prefill and decode?

JonnyKong · 2024-12-30T02:04:40Z

Is this still ongoing? Given that we now have #10502.

hmellor · 2025-02-17T17:52:12Z

https://docs.vllm.ai/en/latest/features/disagg_prefill.html

liuzizhen-1996 · 2025-04-24T15:13:56Z

I encountered a Segmentation Fault (SIGSEGV) error while running the script test_kvcache_comm.py. The error occurs during the initialization of MSCClpp communication when running with Ray. The stack trace points to the following error:

python test_kvcache_comm.py
2025-04-24 23:10:36,109 INFO worker.py:1852 -- Started a local Ray instance.
INFO 04-24 23:10:37 llm_engine.py:70] Initializing an LLM engine with config: model='/data/LLM_Models/DeepSeek-R1-Distill-Qwen-14B', tokenizer='/data/LLM_Models/DeepSeek-R1-Distill-Qwen-14B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)
(RayWorkerVllm pid=338874) *** SIGSEGV received at time=1745507461 on cpu 109 ***
(RayWorkerVllm pid=338874) PC: @ 0x7ed42847c2ac (unknown) mscclpp::Endpoint::Impl::Impl()
(RayWorkerVllm pid=338874) @ 0x7f04665cd520 (unknown) (unknown)
(RayWorkerVllm pid=338874) @ ... and at least 1 more frames
(RayWorkerVllm pid=338874) [2025-04-24 23:11:01,284 E 338874 338874] logging.cc:497: *** SIGSEGV received at time=1745507461 on cpu 109 ***
(RayWorkerVllm pid=338874) [2025-04-24 23:11:01,284 E 338874 338874] logging.cc:497: PC: @ 0x7ed42847c2ac (unknown) mscclpp::Endpoint::Impl::Impl()
(RayWorkerVllm pid=338874) [2025-04-24 23:11:01,284 E 338874 338874] logging.cc:497: @ 0x7f04665cd520 (unknown) (unknown)
(RayWorkerVllm pid=338874) [2025-04-24 23:11:01,284 E 338874 338874] logging.cc:497: @ ... and at least 1 more frames
(RayWorkerVllm pid=338874) Fatal Python error: Segmentation fault
(RayWorkerVllm pid=338874)
(RayWorkerVllm pid=338874) Stack (most recent call first):
(RayWorkerVllm pid=338874) File "/data/liuzizhen/vllm_ex/mscclpp/python/mscclpp/comm.py", line 104 in make_connection
(RayWorkerVllm pid=338874) File "/data/liuzizhen/vllm_ex/vllm-oss/vllm/worker/worker.py", line 141 in init_mscclpp_comm
(RayWorkerVllm pid=338874) File "/data/liuzizhen/vllm_ex/vllm-oss/vllm/worker/worker.py", line 93 in init_model
(RayWorkerVllm pid=338874) File "/data/liuzizhen/vllm_ex/vllm-oss/vllm/engine/ray_utils.py", line 30 in execute_method
(RayWorkerVllm pid=338874) File "/home/liuzizhen/miniconda3/envs/vllm0310/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 463 in _resume_span
(RayWorkerVllm pid=338874) File "/home/liuzizhen/miniconda3/envs/vllm0310/lib/python3.10/site-packages/ray/_private/function_manager.py", line 689 in actor_method_executor
(RayWorkerVllm pid=338874) File "/home/liuzizhen/miniconda3/envs/vllm0310/lib/python3.10/site-packages/ray/_private/worker.py", line 945 in main_loop
(RayWorkerVllm pid=338874) File "/home/liuzizhen/miniconda3/envs/vllm0310/lib/python3.10/site-packages/ray/_private/workers/default_worker.py", line 320 in
(RayWorkerVllm pid=338874)
(RayWorkerVllm pid=338874) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, sentencepiece._sentencepiece, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, cupy_backends.cuda._softlink, cupy_backends.cuda.api._runtime_enum, cupy_backends.cuda.api.runtime, cupy._util, cupy.cuda.device, fastrlock.rlock, cupy.cuda.memory_hook, cupy_backends.cuda.stream, cupy.cuda.graph, cupy.cuda.stream, cupy_backends.cuda.api._driver_enum, cupy_backends.cuda.api.driver, cupy.cuda.memory, cupy._core.internal, cupy._core._carray, cupy.cuda.texture, cupy.cuda.function, cupy_backends.cuda.libs.nvrtc, cupy.cuda.pinned_memory, cupy.cuda.common, cupy.cuda.cub, cupy_backends.cuda.libs.nvtx, cupy.cuda.thrust, cupy._core._dtype, cupy._core._scalar, cupy._core._accelerator, cupy._core._memory_range, cupy._core._fusion_thread_local, cupy._core._kernel, cupy._core._routines_manipulation, cupy._core._routines_binary, cupy._core._optimize_config, cupy._core._cub_reduction, cupy._core._reduction, cupy._core._routines_math, cupy._core._routines_indexing, cupy._core._routines_linalg, cupy._core._routines_logic, cupy._core._routines_sorting, cupy._core._routines_statistics, cupy._core.dlpack, cupy._core.flags, cupy._core.core, cupy._core._fusion_variable, cupy._core._fusion_trace, cupy._core._fusion_kernel, cupy._core.new_fusion, cupy._core.fusion, cupy._core.raw, cupy.fft._cache, cupy.fft._callback, cupy.random._bit_generator, cupy.lib._polynomial (total: 110)
(RayWorkerVllm pid=338873)
(RayWorkerVllm pid=338873)
*** SIGSEGV received at time=1745507461 on cpu 113 ***
PC: @ 0x7fbdc8c812ac (unknown) mscclpp::Endpoint::Impl::Impl()
@ 0x7fbfc6e7a520 (unknown) (unknown)
@ ... and at least 1 more frames
[2025-04-24 23:11:01,491 E 338371 338371] logging.cc:497: *** SIGSEGV received at time=1745507461 on cpu 113 ***
[2025-04-24 23:11:01,491 E 338371 338371] logging.cc:497: PC: @ 0x7fbdc8c812ac (unknown) mscclpp::Endpoint::Impl::Impl()
[2025-04-24 23:11:01,491 E 338371 338371] logging.cc:497: @ 0x7fbfc6e7a520 (unknown) (unknown)
[2025-04-24 23:11:01,491 E 338371 338371] logging.cc:497: @ ... and at least 1 more frames
Fatal Python error: Segmentation fault

Stack (most recent call first):
File "/data/liuzizhen/vllm_ex/mscclpp/python/mscclpp/comm.py", line 104 in make_connection
File "/data/liuzizhen/vllm_ex/vllm-oss/vllm/worker/worker.py", line 141 in init_mscclpp_comm
File "/data/liuzizhen/vllm_ex/vllm-oss/vllm/worker/worker.py", line 93 in init_model
File "/data/liuzizhen/vllm_ex/vllm-oss/vllm/engine/llm_engine.py", line 944 in _run_workers
File "/data/liuzizhen/vllm_ex/vllm-oss/vllm/engine/llm_engine.py", line 257 in _init_workers_ray
File "/data/liuzizhen/vllm_ex/vllm-oss/vllm/engine/llm_engine.py", line 109 in init
File "/data/liuzizhen/vllm_ex/vllm-oss/vllm/engine/llm_engine.py", line 345 in from_engine_args
File "/data/liuzizhen/vllm_ex/vllm-oss/tests/distributed/test_kvcache_comm.py", line 13 in initialize_engine
File "/data/liuzizhen/vllm_ex/vllm-oss/tests/distributed/test_kvcache_comm.py", line 36 in

Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, sentencepiece._sentencepiece, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, cupy_backends.cuda._softlink, cupy_backends.cuda.api._runtime_enum, cupy_backends.cuda.api.runtime, cupy._util, cupy.cuda.device, fastrlock.rlock, cupy.cuda.memory_hook, cupy_backends.cuda.stream, cupy.cuda.graph, cupy.cuda.stream, cupy_backends.cuda.api._driver_enum, cupy_backends.cuda.api.driver, cupy.cuda.memory, cupy._core.internal, cupy._core._carray, cupy.cuda.texture, cupy.cuda.function, cupy_backends.cuda.libs.nvrtc, cupy.cuda.pinned_memory, cupy.cuda.common, cupy.cuda.cub, cupy_backends.cuda.libs.nvtx, cupy.cuda.thrust, cupy._core._dtype, cupy._core._scalar, cupy._core._accelerator, cupy._core._memory_range, cupy._core._fusion_thread_local, cupy._core._kernel, cupy._core._routines_manipulation, cupy._core._routines_binary, cupy._core._optimize_config, cupy._core._cub_reduction, cupy._core._reduction, cupy._core._routines_math, cupy._core._routines_indexing, cupy._core._routines_linalg, cupy._core._routines_logic, cupy._core._routines_sorting, cupy._core._routines_statistics, cupy._core.dlpack, cupy._core.flags, cupy._core.core, cupy._core._fusion_variable, cupy._core._fusion_trace, cupy._core._fusion_kernel, cupy._core.new_fusion, cupy._core.fusion, cupy._core.raw, cupy.fft._cache, cupy.fft._callback, cupy.random._bit_generator, cupy.lib._polynomial (total: 110)
Segmentation fault (core dumped)

What I've Tried
Updated Ray and MSCClpp to the latest versions.

Tested with different Python versions (3.9, 3.10).

Ensured dependencies are correctly installed.

Restarted the machine and retried, but the error persists.

Help Request
Has anyone encountered this issue before? Any suggestions on how to resolve the Segmentation Fault error would be appreciated.

aashaka added 5 commits February 6, 2024 03:01

Add MSCCL++ for KV cache communication

0fef346

Initialize MSCCL++ communication group if sep_prompt_token is set in ParallelConfig. Also add documentation for MSCCL++ installation.

Documentation update for Splitwise

2fada96

Add comments and minor changes for code clarity

e910eb7

aashaka mentioned this pull request Feb 8, 2024

Add Splitwise: prompt and token phase separation #2472

Closed

GindaChen suggested changes Feb 14, 2024

View reviewed changes

Address comments and fix formatting

4811486

hmellor mentioned this pull request Apr 4, 2024

How to use Splitwise(from microsoft) in vllm? #2370

Closed

Jeffwan reviewed Apr 17, 2024

View reviewed changes

duanzhaol mentioned this pull request Apr 20, 2024

[Misc]: How to access the KV cache directly? #4156

Closed

irasin mentioned this pull request Jun 3, 2024

[Roadmap] vLLM Roadmap Q2 2024 #3861

Closed

65 tasks

CSEEduanyu mentioned this pull request Jun 25, 2024

[Roadmap] vLLM Roadmap Q3 2024 #5805

Closed

46 tasks

kerthcet mentioned this pull request Jun 28, 2024

[Usage]: Profiling Prefill and Decode Phases Separately #4900

Closed

GindaChen mentioned this pull request Jul 14, 2024

底层跨group的kv cache传输用的是什么库呢？ LLMServe/DistServe#22

Closed

CSEEduanyu mentioned this pull request Aug 13, 2024

[Feature] Are there plans to implement a prefill-decode split inference architecture? sgl-project/sglang#1080

Closed

WhatGhost mentioned this pull request Sep 10, 2024

[Core] implement disaggregated prefilling via KV cache transfer #6170

Closed

7 tasks

GGBond8488 mentioned this pull request Sep 18, 2024

Can tensorrt-llm or how tensorrt-llm support that seprating the prefill stage and decode stage in different GPU or different nodes with self configuration NVIDIA/TensorRT-LLM#2235

Open

simon-mo requested review from zhuohan123, youkaichao, alexm-redhat, comaniac and njhill as code owners November 26, 2024 05:49

mergify bot added the documentation Improvements or additions to documentation label Nov 26, 2024

mergify bot added ci/build needs-rebase labels Nov 26, 2024

hmellor closed this Feb 17, 2025

Uh oh!

Add Splitwise implementation to vLLM #2809

Add Splitwise implementation to vLLM #2809

Uh oh!

Conversation

aashaka commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Installation dependencies:

Tests:

Known issues:

Uh oh!

casper-hansen commented Feb 8, 2024

Uh oh!

eshachoukse commented Feb 8, 2024

Uh oh!

zhuohan123 commented Feb 10, 2024

Uh oh!

GindaChen commented Feb 12, 2024

Uh oh!

GindaChen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

valvarl commented Feb 22, 2024

Uh oh!

aashaka commented Feb 23, 2024

Uh oh!

leiwen83 commented Feb 25, 2024

Uh oh!

valvarl commented Feb 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aashaka commented Feb 29, 2024

Uh oh!

nanomer commented Mar 1, 2024

Uh oh!

oguzhannfsgl commented Mar 27, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jeffwan Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jeffwan commented Apr 17, 2024

Uh oh!

Jeffwan commented Apr 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

irasin commented Jun 2, 2024

aashaka commented Feb 8, 2024 •

edited

Loading

valvarl commented Feb 26, 2024 •

edited

Loading

Jeffwan Apr 17, 2024 •

edited

Loading