Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dynamo cache size limit option #1619

Merged
merged 1 commit into from
Feb 5, 2025

Conversation

chaojun-zhang
Copy link
Contributor

What does this PR do?

The default value of option torch._dynamo.config.cache_size_limit is 8

This default value might be too small, often leading to out-of-memory (OOM) errors when dynamo compiles certain models. We need to an option in optimum-habana to increase this value to prevent OOM.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@chaojun-zhang chaojun-zhang changed the title Add dynamo cache size limit optio Add dynamo cache size limit option Dec 17, 2024
Copy link
Collaborator

@ssarkar2 ssarkar2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a note in the PR about an example model that benefits from setting a larger cache size (and perhaps a sample cmd line with how much cache size was set for this model)

Thanks

@chaojun-zhang
Copy link
Contributor Author

Could you add a note in the PR about an example model that benefits from setting a larger cache size (and perhaps a sample cmd line with how much cache size was set for this model)

Thanks

PT_HPU_LOG_MOD_MASK=0 PT_HPU_LOG_TYPE_MASK=0 PT_HPU_MAX_COMPOUND_OP_SIZE=512 PT_HPU_LAZY_MODE=0 PT_ENABLE_INT64_SUPPORT=1 numactl --cpunodebind=1 --membind=1 python3 /root/repos/optimum-habana-fork/examples/gaudi_spawn.py --world_size 8 --use_deepspeed /root/repos/optimum-habana-fork/examples/summarization/run_summarization.py --deepspeed /root/repos/optimum-habana-fork/examples/summarization/ds_flan_t5_z3_config_bf16.json --model_name_or_path /software/data/pytorch/huggingface/flan-t5/modelsgoogleflan-t5-xxl --do_train --source_prefix '"summarize:"' --dataset_name cnn_dailymail --dataset_config '"3.0.0"' --output_dir /tmp/tst-summarization --per_device_train_batch_size 22 --per_device_eval_batch_size 22 --learning_rate 0.0001 --overwrite_output_dir --predict_with_generate --use_habana --use_lazy_mode False --gaudi_config_name Habana/t5 --ignore_pad_token_for_loss False --pad_to_max_length --generation_max_length 129 --save_strategy epoch --throughput_warmup_steps 10 --gradient_checkpointing --adam_epsilon 1e-08 --max_eval_samples 880 --dataloader_num_workers 4 --num_train_epochs 1 --max_steps 400 --torch_compile_backend hpu_backend --torch_compile use_regional_compilation --compile_dynamic False --cache_size_limit 128

We set cache_size_limit to 128 , you and if we don't have the cache_size_limit here , it met OOM .

@libinta libinta added the run-test Run CI for PRs from external contributors label Jan 22, 2025
@regisss
Copy link
Collaborator

regisss commented Jan 24, 2025

@chaojun-zhang There is a merge conflict to solve

@chaojun-zhang
Copy link
Contributor Author

@chaojun-zhang There is a merge conflict to solve

Updated

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@regisss regisss merged commit c49fbc3 into huggingface:main Feb 5, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-test Run CI for PRs from external contributors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants