Skip to content
This repository has been archived by the owner on Apr 9, 2024. It is now read-only.

Out of Memory (OOM) error when running the code #21

Open
rgiraud opened this issue Dec 1, 2023 · 1 comment
Open

Out of Memory (OOM) error when running the code #21

rgiraud opened this issue Dec 1, 2023 · 1 comment

Comments

@rgiraud
Copy link

rgiraud commented Dec 1, 2023

Hello,
Thank you for this work and sharing the code.
Unfortunately I have trouble trying to use it.

First, I correctly installed the required packages with:
pip3 install -r requirements.txt
But when running the code, I encountered the following error with clu:

  File "/home/student/slot-attention-video-main/savi/lib/input_pipeline.py", line 105, in create_datasets
    train_split = deterministic_data.get_read_instruction_for_host(
TypeError: get_read_instruction_for_host() got an unexpected keyword argument 'dataset_info'

When upgrading clu to version 0.0.5, the error disappears, and the code runs.

But after a short wait I receive an Out of Memory error on my A5000 (24Go) GPU (which is detected and used as confirmed by nvidia-smi):

...
Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 9680453632 bytes.

To ignore this failure and try to use a fallback algorithm (which may have suboptimal performance), use XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false.  Please also file a bug for the root cause of failing autotuning.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/student/miniconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/student/miniconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/student/slot-attention-video-main/savi/main.py", line 63, in <module>
    app.run(main)
  File "/home/student/env/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/student/env/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/student/slot-attention-video-main/savi/main.py", line 59, in main
    trainer.train_and_evaluate(FLAGS.config, FLAGS.workdir)
  File "/home/student/slot-attention-video-main/savi/lib/trainer.py", line 253, in train_and_evaluate
    opt, state_vars, rng, metrics_update, p_step = p_train_step(
RuntimeError: UNKNOWN: Failed to determine best cudnn convolution algorithm for:
%cudnn-conv.7 = (f32[9216,64,64,64]{2,1,3,0}, u8[0]{0}) custom-call(f32[9216,64,64,64]{2,1,3,0} %bitcast.516, f32[5,5,64,64]{1,0,2,3} %copy.64), window={size=5x5 pad=2_2x2_2}, dim_labels=b01f_01io->b01f, custom_call_target="__cudnn$convForward", metadata={op_name="pmap(train_step)/jit(main)/conv_general_dilated[window_strides=(1, 1) padding=((2, 2), (2, 2)) lhs_dilation=(1, 1) rhs_dilation=(1, 1) dimension_numbers=ConvDimensionNumbers(lhs_spec=(0, 3, 1, 2), rhs_spec=(3, 2, 0, 1), out_spec=(0, 3, 1, 2)) feature_group_count=1 batch_group_count=1 lhs_shape=(9216, 64, 64, 64) rhs_shape=(5, 5, 64, 64) precision=None preferred_element_type=None]" source_file="/home/student/env/lib/python3.9/site-packages/flax/linen/linear.py" source_line=282}, backend_config="{\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}"

Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 9680453632 bytes.

To ignore this failure and try to use a fallback algorithm (which may have suboptimal performance), use XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false.  Please also file a bug for the root cause of failing autotuning.

Some interesting lines among the warning messages:
Removing feature ('frames',) because ragged tensors are not support in JAX.
...

jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: UNKNOWN: Failed to determine best cudnn convolution algorithm for:
%cudnn-conv.7 = ...

Although everything seems well installed and I had no issues with libcudnn...

Would you have any information that may help us to debug this?

Thanks in advance!

@tkipf
Copy link
Contributor

tkipf commented Dec 5, 2023

That's curious, thanks for sharing the detailed error message & background!

I didn't have the chance yet to look into this in detail, but it's indeed strange that the ragged tensor makes its way to JAX (it should only be part of the TF data pipeline). Could you print the shapes of the batch tensor (i.e. all tensors in the batch)?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants