You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have to set up validation during SFT due to the specific task I am fine-tuning.
Are there any ways or suggestions to solve this validation OOM problem?
Thanks in advance!
Reproduction
model
model_name_or_path:
method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json
I encountered the same issue, and I was only able to resolve it by disabling the evaluation during training. Specifically, I had to cancel the eval process while training to avoid the OOM error.
For context, I’m training a full SFT 1.5B model. When using 24k context length and enabling Zero3 on 8x80GH100 GPUs, the training works fine. However, when using 4 nodes with 8x40GA100, I run into OOM issues. I’m not sure if this is a bug related to Llama Factory or something else, but it might be worth looking into.
Reminder
System Info
I am doing long context full SFT, I can enable finetuning by setting:
but I found that OOM happened during the validation stage, I have already set the batch size == 1
I have to set up validation during SFT due to the specific task I am fine-tuning.
Are there any ways or suggestions to solve this validation OOM problem?
Thanks in advance!
Reproduction
model
model_name_or_path:
method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json
dataset
dataset:
template: qwen
cutoff_len: 120000
overwrite_cache: true
preprocessing_num_workers: 90
output
output_dir:
report_to: tensorboard
logging_dir:
logging_steps: 1
save_steps: 190
plot_loss: true
overwrite_output_dir: true
train
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
learning_rate: 1.0e-6
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
max_grad_norm: 1.0
bf16: true
gradient_checkpointing: true
disable_gradient_checkpointing: false
enable_liger_kernel: true
use_unsloth_gc: true
flash_attn: fa2
torch_empty_cache_steps: 10
ddp_timeout: 180000000
save_only_model: true
eval
val_size: 0.02
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 24
Others
No response
The text was updated successfully, but these errors were encountered: