Update train.py #109

Li-Guanda · 2023-02-02T09:36:02Z

A command like "python train.py task=Ant headless=True sim_device=cpu rl_device=cpu" can not work correctly. The reason is "rlg_config_dict" doesn't include the information of "rl_device".

In the "a2c_common.py" of "rl_games", there is a line of code: "self.ppo_device = config.get('device', 'cuda:0')". So the RL algorithm will always only work on the cuda:0.

A command like "python train.py task=Ant headless=True sim_device=cpu rl_device=cpu" can not work correctly. The reason is "rlg_config_dict" doesn't include the information of "rl_device". In the "a2c_common.py" of "rl_games", there is a line of code: "self.ppo_device = config.get('device', 'cuda:0')". So the RL algorithm will always only work on the cuda:0.

tylerlum · 2023-04-11T04:37:35Z

I encountered the same issue! This fix should work, but I think a cleaner solution would be to avoid making a change in train.py ("feels" more like a hack), but instead modify all the *PPO.yaml files (eg. AntPPO.yaml)

We should add in under params.config

params:
....
  config:
....
    device: ${resolve_default:cuda:0,${....rl_device}}  # Used in rl_games/common/a2c_common.py
    device_name: ${resolve_default:cuda:0,${....rl_device}}  # Used in rl_games/common/player.py

This is similar to other config values like

    name: ${resolve_default:Ant,${....experiment}}
    multi_gpu: ${....multi_gpu}
    num_actors: ${....task.env.numEnvs}
    max_epochs: ${resolve_default:500,${....max_iterations}}

which use values from the top-level config (which has rl_device), but also gives a default value.

utomm · 2023-04-25T03:19:23Z

Hi, thanks for the fix and discussion. Your solution works well with device like 'cuda:2' #129 .

However when using rl_device=cpu, the process still crashed. In training case python train.py task=Ant headless=True sim_device=cpu rl_device=cpu

it will crash before the first update of the policy

Error executing job with overrides: ['task=Ant', 'headless=True', 'sim_device=cpu', 'rl_device=cpu']
Traceback (most recent call last):
  File "train.py", line 161, in launch_rlg_hydra
    'sigma' : None
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 120, in run
    self.run_train(args)
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 101, in run_train
    agent.train()
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/common/a2c_common.py", line 1173, in train
    step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch()
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/common/a2c_common.py", line 1059, in train_epoch
    a_loss, c_loss, entropy, kl, last_lr, lr_mul, cmu, csigma, b_loss = self.train_actor_critic(self.dataset[i])
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/a2c_continuous.py", line 159, in train_actor_critic
    self.calc_gradients(input_dict)
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/a2c_continuous.py", line 135, in calc_gradients
    self.scaler.scale(loss).backward()
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 162, in scale
    assert outputs.is_cuda or outputs.device.type == 'xla'
AssertionError

and in testing case python train.py task=Ant headless=True sim_device=cpu rl_device=cpu test=True, the tensors in different device appears again, the output error is

Error executing job with overrides: ['task=Ant', 'headless=True', 'sim_device=cpu', 'rl_device=cpu', 'test=True']
Traceback (most recent call last):
  File "train.py", line 161, in launch_rlg_hydra
    'sigma' : None
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 123, in run
    self.run_play(args)
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 108, in run_play
    player.run()
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/common/player.py", line 208, in run
    action = self.get_action(obses, is_determenistic)
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/players.py", line 55, in get_action
    res_dict = self.model(input_dict)
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/models.py", line 246, in forward
    input_dict['obs'] = self.norm_obs(input_dict['obs'])
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/models.py", line 49, in norm_obs
    return self.running_mean_std(observation) if self.normalize_input else observation
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hu/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/running_mean_std.py", line 79, in forward
    y = (input - current_mean.float()) / torch.sqrt(current_var.float() + self.epsilon)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

tylerlum · 2023-07-11T17:53:04Z

NOTE: edited solution above to include device_name. This fixes the problem for python train.py task=Ant headless=True sim_device=cpu rl_device=cpu test=True.

This doesn't fix the other issue though. I believe this is from

        self.scaler = torch.cuda.amp.GradScaler(enabled=self.mixed_precision)

in rl_games/common/a2c_common.py, which would need more work to fix

utomm mentioned this pull request Apr 25, 2023

Trying to run on cuda:1 crashes #129

Closed

lgd21356 mentioned this pull request Sep 21, 2024

Fix a bug that can not set the training device correctly NVlabs/CALM#18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update train.py #109

Update train.py #109

Li-Guanda commented Feb 2, 2023

tylerlum commented Apr 11, 2023 •

edited

Loading

utomm commented Apr 25, 2023

tylerlum commented Jul 11, 2023

Update train.py #109

Are you sure you want to change the base?

Update train.py #109

Conversation

Li-Guanda commented Feb 2, 2023

tylerlum commented Apr 11, 2023 • edited Loading

utomm commented Apr 25, 2023

tylerlum commented Jul 11, 2023

tylerlum commented Apr 11, 2023 •

edited

Loading