Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CUDA autocast bug #1703

Open
3 tasks done
FrankTianTT opened this issue Nov 19, 2023 · 0 comments
Open
3 tasks done

[BUG] CUDA autocast bug #1703

FrankTianTT opened this issue Nov 19, 2023 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@FrankTianTT
Copy link
Contributor

FrankTianTT commented Nov 19, 2023

Describe the bug

When using `autocast`` in dreamer example, there is a RuntimeError:

RuntimeError: masked_scatter_: expected self and source to have same dtypes but gotHalf and Float

Unfortunately, its seem to be a bug of PyTorch itself. (pytorch/pytorch#81876)

To Reproduce

Run dreamer example.

Whole output:

collector: MultiaSyncDataCollector()                                                                                                                               
init seed: 42, final seed: 971637020                                                                                                                               
  7%|████████▌                                                                                                            | 36800/500000 [20:23<4:48:42, 26.74it/s]
Error executing job with overrides: []                                                                                                                             
Traceback (most recent call last):                                                                                                                                 
  File "/home/frank/Projects/rl_dev/examples/dreamer/dreamer.py", line 359, in main                                                                                
    scaler2.scale(actor_loss_td["loss_actor"]).backward()                                                                                                          
  File "/home/frank/anaconda3/envs/rl_dev/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward                                                     
    torch.autograd.backward(                                                                                                                                       
  File "/home/frank/anaconda3/envs/rl_dev/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward                                           
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                 
RuntimeError: masked_scatter_: expected self and source to have same dtypes but gotHalf and Float 

System info

Describe the characteristic of your environment:

  • Describe how the library was installed (pip, source, ...)
  • Python version
  • Versions of any other relevant libraries
import torchrl, numpy, sys
print(torchrl.__version__, numpy.__version__, sys.version, sys.platform)
0.2.1 1.26.1 3.9.18 (main, Sep 11 2023, 13:41:44) 
[GCC 11.2.0] linux

Reason and Possible fixes

Maybe we should disable autocast until this bug is fixed by torch?

Checklist

  • I have checked that there is no similar issue in the repo (required)
  • I have read the documentation (required)
  • I have provided a minimal working example to reproduce the bug (required)
@FrankTianTT FrankTianTT added the bug Something isn't working label Nov 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants