You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 9, 2024. It is now read-only.
Hello,
Thank you for this work and sharing the code.
Unfortunately I have trouble trying to use it.
First, I correctly installed the required packages with: pip3 install -r requirements.txt
But when running the code, I encountered the following error with clu:
File "/home/student/slot-attention-video-main/savi/lib/input_pipeline.py", line 105, in create_datasets
train_split = deterministic_data.get_read_instruction_for_host(
TypeError: get_read_instruction_for_host() got an unexpected keyword argument 'dataset_info'
When upgrading clu to version 0.0.5, the error disappears, and the code runs.
But after a short wait I receive an Out of Memory error on my A5000 (24Go) GPU (which is detected and used as confirmed by nvidia-smi):
...
Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 9680453632 bytes.
To ignore this failure and try to use a fallback algorithm (which may have suboptimal performance), use XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false. Please also file a bug for the root cause of failing autotuning.
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/student/miniconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/student/miniconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/student/slot-attention-video-main/savi/main.py", line 63, in <module>
app.run(main)
File "/home/student/env/lib/python3.9/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/home/student/env/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/home/student/slot-attention-video-main/savi/main.py", line 59, in main
trainer.train_and_evaluate(FLAGS.config, FLAGS.workdir)
File "/home/student/slot-attention-video-main/savi/lib/trainer.py", line 253, in train_and_evaluate
opt, state_vars, rng, metrics_update, p_step = p_train_step(
RuntimeError: UNKNOWN: Failed to determine best cudnn convolution algorithm for:
%cudnn-conv.7 = (f32[9216,64,64,64]{2,1,3,0}, u8[0]{0}) custom-call(f32[9216,64,64,64]{2,1,3,0} %bitcast.516, f32[5,5,64,64]{1,0,2,3} %copy.64), window={size=5x5 pad=2_2x2_2}, dim_labels=b01f_01io->b01f, custom_call_target="__cudnn$convForward", metadata={op_name="pmap(train_step)/jit(main)/conv_general_dilated[window_strides=(1, 1) padding=((2, 2), (2, 2)) lhs_dilation=(1, 1) rhs_dilation=(1, 1) dimension_numbers=ConvDimensionNumbers(lhs_spec=(0, 3, 1, 2), rhs_spec=(3, 2, 0, 1), out_spec=(0, 3, 1, 2)) feature_group_count=1 batch_group_count=1 lhs_shape=(9216, 64, 64, 64) rhs_shape=(5, 5, 64, 64) precision=None preferred_element_type=None]" source_file="/home/student/env/lib/python3.9/site-packages/flax/linen/linear.py" source_line=282}, backend_config="{\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}"
Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 9680453632 bytes.
To ignore this failure and try to use a fallback algorithm (which may have suboptimal performance), use XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false. Please also file a bug for the root cause of failing autotuning.
Some interesting lines among the warning messages: Removing feature ('frames',) because ragged tensors are not support in JAX.
...
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: UNKNOWN: Failed to determine best cudnn convolution algorithm for:
%cudnn-conv.7 = ...
Although everything seems well installed and I had no issues with libcudnn...
Would you have any information that may help us to debug this?
Thanks in advance!
The text was updated successfully, but these errors were encountered:
That's curious, thanks for sharing the detailed error message & background!
I didn't have the chance yet to look into this in detail, but it's indeed strange that the ragged tensor makes its way to JAX (it should only be part of the TF data pipeline). Could you print the shapes of the batch tensor (i.e. all tensors in the batch)?
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hello,
Thank you for this work and sharing the code.
Unfortunately I have trouble trying to use it.
First, I correctly installed the required packages with:
pip3 install -r requirements.txt
But when running the code, I encountered the following error with clu:
When upgrading clu to version 0.0.5, the error disappears, and the code runs.
But after a short wait I receive an Out of Memory error on my A5000 (24Go) GPU (which is detected and used as confirmed by nvidia-smi):
Some interesting lines among the warning messages:
Removing feature ('frames',) because ragged tensors are not support in JAX.
...
Although everything seems well installed and I had no issues with libcudnn...
Would you have any information that may help us to debug this?
Thanks in advance!
The text was updated successfully, but these errors were encountered: