-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DistEnv] Strategy for ray
on multiple nodes, with sharding as option
#1622
Comments
There are multiple ways to achieve model parallelism for general torch models.
The above are two most popular libraries which enable model parallelism. i.e Deepspeed will take you models and partition it into parts and then manage the communication between partitions on multiple gpu. So if user has a model he can use deep speed to shard it across multiple models. Lets take 2 scenarios
From now on, we will be referring to the above scenario list! If user falls under scenario 1, Directly using deepspeed will suffice and he can achieve model parallelism, but im not sure if we will have a dashboard to view the process. Now comes Scenario 2, if user has 2 machines, sharding the model on multiple machines might not be the best scenario as the inter node communication can become a bottleneck! moreover inter node model parallelism with deepspeed requires some manual tasks like hostfile creation, etc But, ray can handle the inter node/machine communication very elegantly ![]() Lets take a look at above diagram. The blue box is ray which takes a batch of input data and distributes across the two machines and each partitioned batch on the machine is given to deepspeed which has a copy of model sharded/distributed on 2 Gpus in that machine If batch size is 16 8 batch size data (partition 1) will be given to machine 1 and model will be distritbued/paritioned on the 2 gpus in that machine with deepspeed. same happens on machine 2. The result it calculated and gathered back by ray and returned on client node. |
@kartik4949 great explanation! Questions:
|
The vLLM uses the method like this to support the tensor model parallel It won't use what's here. But our support for basic transformers or torch models can use this |
Great Otherwise, the model will still load onto the local machine, which would make this feature somewhat underwhelming. |
ray
on multiple nodes, with sharding as optionray
on multiple nodes, with sharding as option
@kartik4949 to add information, discussion points, diagrams, links.
The text was updated successfully, but these errors were encountered: