Add dynamo cache size limit option #1619

chaojun-zhang · 2024-12-17T01:54:28Z

What does this PR do?

The default value of option torch._dynamo.config.cache_size_limit is 8

This default value might be too small, often leading to out-of-memory (OOM) errors when dynamo compiles certain models. We need to an option in optimum-habana to increase this value to prevent OOM.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

ssarkar2

Could you add a note in the PR about an example model that benefits from setting a larger cache size (and perhaps a sample cmd line with how much cache size was set for this model)

Thanks

chaojun-zhang · 2025-01-16T02:20:11Z

Could you add a note in the PR about an example model that benefits from setting a larger cache size (and perhaps a sample cmd line with how much cache size was set for this model)

Thanks

PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0 PT_HPU_MAX_COMPOUND_OP_SIZE=512 PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 numactl --cpunodebind=1 --membind=1 python3 /root/repos/optimum-habana-fork/examples/gaudi_spawn.py --world_size 8 --use_deepspeed /root/repos/optimum-habana-fork/examples/summarization/run_summarization.py --deepspeed /root/repos/optimum-habana-fork/examples/summarization/ds_flan_t5_z3_config_bf16.json --model_name_or_path /software/data/pytorch/huggingface/flan-t5/modelsgoogleflan-t5-xxl --do_train --source_prefix '"summarize:"' --dataset_name cnn_dailymail --dataset_config '"3.0.0"' --output_dir /tmp/tst-summarization --per_device_train_batch_size 22 --per_device_eval_batch_size 22 --learning_rate 0.0001 --overwrite_output_dir --predict_with_generate --use_habana --use_lazy_mode False --gaudi_config_name Habana/t5 --ignore_pad_token_for_loss False --pad_to_max_length --generation_max_length 129 --save_strategy epoch --throughput_warmup_steps 10 --gradient_checkpointing --adam_epsilon 1e-08 --max_eval_samples 880 --dataloader_num_workers 4 --num_train_epochs 1 --max_steps 400 --torch_compile_backend hpu_backend --torch_compile use_regional_compilation --compile_dynamic False --cache_size_limit 128

We set cache_size_limit to 128 , you and if we don't have the cache_size_limit here , it met OOM .

regisss · 2025-01-24T16:08:59Z

@chaojun-zhang There is a merge conflict to solve

chaojun-zhang · 2025-02-05T01:43:56Z

@chaojun-zhang There is a merge conflict to solve

Updated

HuggingFaceDocBuilderDev · 2025-02-05T10:50:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

regisss

LGTM!

chaojun-zhang requested a review from regisss as a code owner December 17, 2024 01:54

chaojun-zhang changed the title ~~Add dynamo cache size limit optio~~ Add dynamo cache size limit option Dec 17, 2024

chaojun-zhang force-pushed the auto-pr-53ee9da branch from c5e6203 to 25cb002 Compare December 17, 2024 03:23

ssarkar2 approved these changes Jan 6, 2025

View reviewed changes

ssarkar2 approved these changes Jan 16, 2025

View reviewed changes

libinta added the run-test Run CI for PRs from external contributors label Jan 22, 2025

[SW-211857] add dynamo cache size limit option

37b0262

chaojun-zhang force-pushed the auto-pr-53ee9da branch from 25cb002 to 37b0262 Compare February 5, 2025 01:43

regisss approved these changes Feb 5, 2025

View reviewed changes

regisss merged commit c49fbc3 into huggingface:main Feb 5, 2025
4 checks passed

regisss mentioned this pull request Feb 5, 2025

Add the inline_inbuilt_nn_modules option #1617

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dynamo cache size limit option #1619

Add dynamo cache size limit option #1619

chaojun-zhang commented Dec 17, 2024

ssarkar2 left a comment

chaojun-zhang commented Jan 16, 2025

regisss commented Jan 24, 2025

chaojun-zhang commented Feb 5, 2025

HuggingFaceDocBuilderDev commented Feb 5, 2025

regisss left a comment

Add dynamo cache size limit option #1619

Add dynamo cache size limit option #1619

Conversation

chaojun-zhang commented Dec 17, 2024

What does this PR do?

Before submitting

ssarkar2 left a comment

Choose a reason for hiding this comment

chaojun-zhang commented Jan 16, 2025

regisss commented Jan 24, 2025

chaojun-zhang commented Feb 5, 2025

HuggingFaceDocBuilderDev commented Feb 5, 2025

regisss left a comment

Choose a reason for hiding this comment