Recommendations for multi-node training of a 7B model with RL #487

zhudefa · 2024-12-17T02:38:59Z

If I want to run multi-node 7BRL training experiments, what is the recommended configuration? Should actor_num_gpus_per_node be set to multiple 7s?

Is it also necessary to launch in the same way as the 70B model, using the following command:
source configs/beaker_configs/ray_node_setup.sh && python open_instruct/ppo_vllm_thread_ray_gtrl.py?

The text was updated successfully, but these errors were encountered:

vwxyzjn · 2024-12-18T23:03:33Z

Hi @zhudefa,

It doesn't have to be a multiple of 7s.

7 is fine for a single node setting: we use 7 gpus for training and 1 gpu for inference. Its usage is like this:

--actor_num_gpus_per_node 7 8 8 8 meaning using 7 gpus in the first node to do training and 8 gpus in the next 3 nodes to do training.

Is it also necessary to launch in the same way as the 70B model, using the following command:
source configs/beaker_configs/ray_node_setup.sh && python open_instruct/ppo_vllm_thread_ray_gtrl.py?

yes. the ray_node_setup.sh setups the multi node ray stuff to connect to the main ray head node.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommendations for multi-node training of a 7B model with RL #487

Recommendations for multi-node training of a 7B model with RL #487

zhudefa commented Dec 17, 2024

vwxyzjn commented Dec 18, 2024 •

edited

Loading

Recommendations for multi-node training of a 7B model with RL #487

Recommendations for multi-node training of a 7B model with RL #487

Comments

zhudefa commented Dec 17, 2024

vwxyzjn commented Dec 18, 2024 • edited Loading

vwxyzjn commented Dec 18, 2024 •

edited

Loading