[DistEnv] Strategy for `ray` on multiple nodes, with sharding as option #1622

blythed · 2023-12-30T09:01:27Z

@kartik4949 to add information, discussion points, diagrams, links.

kartik4949 · 2023-12-30T12:56:23Z

There are multiple ways to achieve model parallelism for general torch models.

Deepspeed
FSDP

The above are two most popular libraries which enable model parallelism.
These libraries are basically model parallelism algorithms with gpu inter communication support.

i.e Deepspeed will take you models and partition it into parts and then manage the communication between partitions on multiple gpu.
Source:
https://www.deepspeed.ai/inference/

So if user has a model he can use deep speed to shard it across multiple models.

Lets take 2 scenarios

1 Machine with 4 GPUS
2 Machine with 2 GPUs each (4 Total)

From now on, we will be referring to the above scenario list!

If user falls under scenario 1, Directly using deepspeed will suffice and he can achieve model parallelism, but im not sure if we will have a dashboard to view the process.
Here deep speed will create one process per GPU and partition the model for inference.

Now comes Scenario 2, if user has 2 machines, sharding the model on multiple machines might not be the best scenario as the inter node communication can become a bottleneck!

moreover inter node model parallelism with deepspeed requires some manual tasks like hostfile creation, etc

But, ray can handle the inter node/machine communication very elegantly
so the idea is what if we create a local intra machine gpu worker group with deepspeed and create this group on each node/machine

Lets take a look at above diagram.

The blue box is ray which takes a batch of input data and distributes across the two machines and each partitioned batch on the machine is given to deepspeed which has a copy of model sharded/distributed on 2 Gpus in that machine
so for e.g

If batch size is 16

8 batch size data (partition 1) will be given to machine 1 and model will be distritbued/paritioned on the 2 gpus in that machine with deepspeed.

same happens on machine 2.

The result it calculated and gathered back by ray and returned on client node.

blythed · 2023-12-30T18:18:16Z

@kartik4949 great explanation!

Questions:

how much support do we have for this scenario using vLLM?
is there any difference between FSDP and deepspeed?

jieguangzhou · 2024-01-02T09:12:41Z

@kartik4949 great explanation!

Questions:

how much support do we have for this scenario using vLLM?

is there any difference between FSDP and deepspeed?

The vLLM uses the method like this to support the tensor model parallel

It won't use what's here.

But our support for basic transformers or torch models can use this

jieguangzhou · 2024-01-02T11:48:50Z

Great
If we implement this, we'll be able to completely offload the model computation layer to a Ray cluster. Also, I suggest completing this in conjunction with the this #1604.

Otherwise, the model will still load onto the local machine, which would make this feature somewhat underwhelming.

blythed mentioned this issue Dec 30, 2023

Ray compute plane #1356

Closed

9 tasks

fnikolai changed the title ~~Strategy for ray on multiple nodes, with sharding as option~~ [DistEnv] Strategy for ray on multiple nodes, with sharding as option May 21, 2024

fnikolai mentioned this issue May 21, 2024

[DistEnv] Distributed Environment #2086

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DistEnv] Strategy for `ray` on multiple nodes, with sharding as option #1622

[DistEnv] Strategy for `ray` on multiple nodes, with sharding as option #1622

blythed commented Dec 30, 2023 •

edited

Loading

kartik4949 commented Dec 30, 2023

blythed commented Dec 30, 2023 •

edited

Loading

jieguangzhou commented Jan 2, 2024 •

edited

Loading

jieguangzhou commented Jan 2, 2024

[DistEnv] Strategy for ray on multiple nodes, with sharding as option #1622

[DistEnv] Strategy for ray on multiple nodes, with sharding as option #1622

Comments

blythed commented Dec 30, 2023 • edited Loading

kartik4949 commented Dec 30, 2023

blythed commented Dec 30, 2023 • edited Loading

jieguangzhou commented Jan 2, 2024 • edited Loading

jieguangzhou commented Jan 2, 2024

[DistEnv] Strategy for `ray` on multiple nodes, with sharding as option #1622

[DistEnv] Strategy for `ray` on multiple nodes, with sharding as option #1622

blythed commented Dec 30, 2023 •

edited

Loading

blythed commented Dec 30, 2023 •

edited

Loading

jieguangzhou commented Jan 2, 2024 •

edited

Loading