Skip to content

[BUG] CUDA autocast bug #1703

Open
Open
@FrankTianTT

Description

@FrankTianTT

Describe the bug

When using `autocast`` in dreamer example, there is a RuntimeError:

RuntimeError: masked_scatter_: expected self and source to have same dtypes but gotHalf and Float

Unfortunately, its seem to be a bug of PyTorch itself. (pytorch/pytorch#81876)

To Reproduce

Run dreamer example.

Whole output:

collector: MultiaSyncDataCollector()                                                                                                                               
init seed: 42, final seed: 971637020                                                                                                                               
  7%|████████▌                                                                                                            | 36800/500000 [20:23<4:48:42, 26.74it/s]
Error executing job with overrides: []                                                                                                                             
Traceback (most recent call last):                                                                                                                                 
  File "/home/frank/Projects/rl_dev/examples/dreamer/dreamer.py", line 359, in main                                                                                
    scaler2.scale(actor_loss_td["loss_actor"]).backward()                                                                                                          
  File "/home/frank/anaconda3/envs/rl_dev/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward                                                     
    torch.autograd.backward(                                                                                                                                       
  File "/home/frank/anaconda3/envs/rl_dev/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward                                           
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                 
RuntimeError: masked_scatter_: expected self and source to have same dtypes but gotHalf and Float 

System info

Describe the characteristic of your environment:

  • Describe how the library was installed (pip, source, ...)
  • Python version
  • Versions of any other relevant libraries
import torchrl, numpy, sys
print(torchrl.__version__, numpy.__version__, sys.version, sys.platform)
0.2.1 1.26.1 3.9.18 (main, Sep 11 2023, 13:41:44) 
[GCC 11.2.0] linux

Reason and Possible fixes

Maybe we should disable autocast until this bug is fixed by torch?

Checklist

  • I have checked that there is no similar issue in the repo (required)
  • I have read the documentation (required)
  • I have provided a minimal working example to reproduce the bug (required)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions