beta参数在GRPO中失效 #4112

tomato996 · 2025-05-07T07:49:07Z

GRPO训练参数将beta参数设置为0.0，输出loss理论为0，但实际上不为0。
MAX_PIXELS=602112 \ CUDA_VISIBLE_DEVICES=0,1,2,3 \ NPROC_PER_NODE=2 \ swift rlhf \ --rlhf_type grpo \ --model /data/szy/huggingface_cache/models/Qwen2.5-VL-7b\ --external_plugins /data/szy/llib/ms-swift-main/examples/train/grpo/plugin/plugin.py \ --reward_funcs external_math_format external_qa_reward \ --use_vllm true \ --train_type full \ --torch_dtype bfloat16 \ --dataset /data/szy/llib/benchmark/ssg-qa-medium/train_data_anatomy_sgg_qa_medium_2.json \ --max_completion_length 512 \ --num_train_epochs 5 \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 8 \ --per_device_eval_batch_size 4 \ --learning_rate 1e-6 \ --eval_steps 1000 \ --save_steps 10 \ --save_total_limit 2 \ --logging_steps 1 \ --output_dir output \ --warmup_ratio 0 \ --dataloader_num_workers 4 \ --dataset_num_proc 4 \ --num_generations 8 \ --temperature 1.0 \ --top_p 1.0 \ --top_k 50 \ --async_generate true \ --dynamic_sample true \ --epsilon_high 0.28 \ --system 'examples/train/grpo/prompt.txt' \ --deepspeed zero3 \ --log_completions true \ --num_iterations 1 \ --beta 0 \ --num_infer_workers 2 \ --report_to wandb \

The text was updated successfully, but these errors were encountered:

hjh0119 · 2025-05-07T08:26:47Z

What is the version of ms-swift?

In swift 3.3 , the default loss normalization is at the token level, which means longer completions receive greater weight.
In swift 3.4 , the default loss normalization is at the sentence level, which means the loss is expected to approach zero when beta equals 0.

hjh0119 added the needs more info Additional information or clarification is required to proceed label May 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

beta参数在GRPO中失效 #4112

beta参数在GRPO中失效 #4112

tomato996 commented May 7, 2025

hjh0119 commented May 7, 2025

beta参数在GRPO中失效 #4112

beta参数在GRPO中失效 #4112

Comments

tomato996 commented May 7, 2025

hjh0119 commented May 7, 2025