Out of Memory (OOM) error when running the code #21

rgiraud · 2023-12-01T00:28:24Z

Hello,
Thank you for this work and sharing the code.
Unfortunately I have trouble trying to use it.

First, I correctly installed the required packages with:
pip3 install -r requirements.txt
But when running the code, I encountered the following error with clu:

  File "/home/student/slot-attention-video-main/savi/lib/input_pipeline.py", line 105, in create_datasets
    train_split = deterministic_data.get_read_instruction_for_host(
TypeError: get_read_instruction_for_host() got an unexpected keyword argument 'dataset_info'

When upgrading clu to version 0.0.5, the error disappears, and the code runs.

But after a short wait I receive an Out of Memory error on my A5000 (24Go) GPU (which is detected and used as confirmed by nvidia-smi):

...
Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 9680453632 bytes.

To ignore this failure and try to use a fallback algorithm (which may have suboptimal performance), use XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false.  Please also file a bug for the root cause of failing autotuning.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/student/miniconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/student/miniconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/student/slot-attention-video-main/savi/main.py", line 63, in <module>
    app.run(main)
  File "/home/student/env/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/student/env/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/student/slot-attention-video-main/savi/main.py", line 59, in main
    trainer.train_and_evaluate(FLAGS.config, FLAGS.workdir)
  File "/home/student/slot-attention-video-main/savi/lib/trainer.py", line 253, in train_and_evaluate
    opt, state_vars, rng, metrics_update, p_step = p_train_step(
RuntimeError: UNKNOWN: Failed to determine best cudnn convolution algorithm for:
%cudnn-conv.7 = (f32[9216,64,64,64]{2,1,3,0}, u8[0]{0}) custom-call(f32[9216,64,64,64]{2,1,3,0} %bitcast.516, f32[5,5,64,64]{1,0,2,3} %copy.64), window={size=5x5 pad=2_2x2_2}, dim_labels=b01f_01io->b01f, custom_call_target="__cudnn$convForward", metadata={op_name="pmap(train_step)/jit(main)/conv_general_dilated[window_strides=(1, 1) padding=((2, 2), (2, 2)) lhs_dilation=(1, 1) rhs_dilation=(1, 1) dimension_numbers=ConvDimensionNumbers(lhs_spec=(0, 3, 1, 2), rhs_spec=(3, 2, 0, 1), out_spec=(0, 3, 1, 2)) feature_group_count=1 batch_group_count=1 lhs_shape=(9216, 64, 64, 64) rhs_shape=(5, 5, 64, 64) precision=None preferred_element_type=None]" source_file="/home/student/env/lib/python3.9/site-packages/flax/linen/linear.py" source_line=282}, backend_config="{\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}"

Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 9680453632 bytes.

To ignore this failure and try to use a fallback algorithm (which may have suboptimal performance), use XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false.  Please also file a bug for the root cause of failing autotuning.

Some interesting lines among the warning messages:
Removing feature ('frames',) because ragged tensors are not support in JAX.
...

jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: UNKNOWN: Failed to determine best cudnn convolution algorithm for:
%cudnn-conv.7 = ...

Although everything seems well installed and I had no issues with libcudnn...

Would you have any information that may help us to debug this?

Thanks in advance!

The text was updated successfully, but these errors were encountered:

tkipf · 2023-12-05T00:07:07Z

That's curious, thanks for sharing the detailed error message & background!

I didn't have the chance yet to look into this in detail, but it's indeed strange that the ragged tensor makes its way to JAX (it should only be part of the TF data pipeline). Could you print the shapes of the batch tensor (i.e. all tensors in the batch)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of Memory (OOM) error when running the code #21

Out of Memory (OOM) error when running the code #21

rgiraud commented Dec 1, 2023 •

edited

Loading

tkipf commented Dec 5, 2023

Out of Memory (OOM) error when running the code #21

Out of Memory (OOM) error when running the code #21

Comments

rgiraud commented Dec 1, 2023 • edited Loading

tkipf commented Dec 5, 2023

rgiraud commented Dec 1, 2023 •

edited

Loading