Skip to content

beta参数在GRPO中失效 #4112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tomato996 opened this issue May 7, 2025 · 1 comment
Open

beta参数在GRPO中失效 #4112

tomato996 opened this issue May 7, 2025 · 1 comment
Labels
needs more info Additional information or clarification is required to proceed

Comments

@tomato996
Copy link

GRPO训练参数将beta参数设置为0.0,输出loss理论为0,但实际上不为0。
MAX_PIXELS=602112 \ CUDA_VISIBLE_DEVICES=0,1,2,3 \ NPROC_PER_NODE=2 \ swift rlhf \ --rlhf_type grpo \ --model /data/szy/huggingface_cache/models/Qwen2.5-VL-7b\ --external_plugins /data/szy/llib/ms-swift-main/examples/train/grpo/plugin/plugin.py \ --reward_funcs external_math_format external_qa_reward \ --use_vllm true \ --train_type full \ --torch_dtype bfloat16 \ --dataset /data/szy/llib/benchmark/ssg-qa-medium/train_data_anatomy_sgg_qa_medium_2.json \ --max_completion_length 512 \ --num_train_epochs 5 \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 8 \ --per_device_eval_batch_size 4 \ --learning_rate 1e-6 \ --eval_steps 1000 \ --save_steps 10 \ --save_total_limit 2 \ --logging_steps 1 \ --output_dir output \ --warmup_ratio 0 \ --dataloader_num_workers 4 \ --dataset_num_proc 4 \ --num_generations 8 \ --temperature 1.0 \ --top_p 1.0 \ --top_k 50 \ --async_generate true \ --dynamic_sample true \ --epsilon_high 0.28 \ --system 'examples/train/grpo/prompt.txt' \ --deepspeed zero3 \ --log_completions true \ --num_iterations 1 \ --beta 0 \ --num_infer_workers 2 \ --report_to wandb \

Image

@hjh0119
Copy link
Collaborator

hjh0119 commented May 7, 2025

What is the version of ms-swift?

  • In swift 3.3 , the default loss normalization is at the token level, which means longer completions receive greater weight.
  • In swift 3.4 , the default loss normalization is at the sentence level, which means the loss is expected to approach zero when beta equals 0.

@hjh0119 hjh0119 added the needs more info Additional information or clarification is required to proceed label May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs more info Additional information or clarification is required to proceed
Projects
None yet
Development

No branches or pull requests

2 participants