-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dynamo cache size limit option #1619
Conversation
c5e6203
to
25cb002
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a note in the PR about an example model that benefits from setting a larger cache size (and perhaps a sample cmd line with how much cache size was set for this model)
Thanks
PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0 PT_HPU_MAX_COMPOUND_OP_SIZE=512 PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 numactl --cpunodebind=1 --membind=1 python3 /root/repos/optimum-habana-fork/examples/gaudi_spawn.py --world_size 8 --use_deepspeed /root/repos/optimum-habana-fork/examples/summarization/run_summarization.py --deepspeed /root/repos/optimum-habana-fork/examples/summarization/ds_flan_t5_z3_config_bf16.json --model_name_or_path /software/data/pytorch/huggingface/flan-t5/modelsgoogleflan-t5-xxl --do_train --source_prefix '"summarize:"' --dataset_name cnn_dailymail --dataset_config '"3.0.0"' --output_dir /tmp/tst-summarization --per_device_train_batch_size 22 --per_device_eval_batch_size 22 --learning_rate 0.0001 --overwrite_output_dir --predict_with_generate --use_habana --use_lazy_mode False --gaudi_config_name Habana/t5 --ignore_pad_token_for_loss False --pad_to_max_length --generation_max_length 129 --save_strategy epoch --throughput_warmup_steps 10 --gradient_checkpointing --adam_epsilon 1e-08 --max_eval_samples 880 --dataloader_num_workers 4 --num_train_epochs 1 --max_steps 400 --torch_compile_backend hpu_backend --torch_compile use_regional_compilation --compile_dynamic False --cache_size_limit 128 We set cache_size_limit to 128 , you and if we don't have the cache_size_limit here , it met OOM . |
@chaojun-zhang There is a merge conflict to solve |
25cb002
to
37b0262
Compare
Updated |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What does this PR do?
The default value of option torch._dynamo.config.cache_size_limit is 8
This default value might be too small, often leading to out-of-memory (OOM) errors when dynamo compiles certain models. We need to an option in optimum-habana to increase this value to prevent OOM.
Fixes # (issue)
Before submitting