- [2025/04/22] 🔥🔥Released our Paper on arXiv. See here🔥🔥
- [2025/03/24] We re-implement our algorithm based on verl. ✨✨ Key features: (1) add ~50 additional metrics to comprehensively monitor the training process and stability, (2) add a custom wandb workerspace to monitor ~20 important metrics, (3) add curriculum learning.✨✨
- [2025/02/22] We release the notion blog, which details our algorithm, the difference between gamma-decay and min-form credit assignment, examples of reward hacking, and so on.
- [2025/02/09] We release the training, evaluation code, wandb, and checkpoints. Paper's on it's way!
This month, we saw a huge boost in LLM reasoning power from the verifiable reward (VR)-based Reinforcement learning fine-tuning (ReFT). Previous work has encountered challenges and made unsuccessful attempts in exploring PRM, so we wonder: How far can PRM actually take us? How does it stack up against VR-based methods in reasoning performance, training costs?
To answer these questions, we present PURE (Process-sUpervised Reinforcement lEarning). Employing Qwen2.5-Math-7B as the base model, we train a PRM using PRM800K dataset, and then fine-tune another Qwen2.5-Math-7B model using only 8K MATH prompts, process rewards from the PRM, and optional verifiable rewards. For the RL algorithm, we use the PPO loss with an RLOO advantage estimator. We improve credit assignment by using a weighted sum of the process rewards,
📊 The final model achieves pass@1 accuracy of 82.6% on MATH500, 82.5% on AMC, and 53.3% on average across 5 benchmarks, beating Qwen2.5-math-7B-instruct, PRIME, and SimpleRL with just either <1/50th RL data or 1/5th of the compute resources.
All results are in pass@1 accuracy
AIME 2024 | MATH 500 | AMC | Minerva Math | OlympiadBench | Avg. | |
---|---|---|---|---|---|---|
Qwen2.5-Math-7B-Base | 13.3 | 71.8 | 47.5 | 29.8 | 35.1 | 39.5 |
Qwen-2.5-Math-7B-Instruct | 16.7 | 83.2 | 52.5 | 37.5 | 41.3 | 46.2 |
Eurus-2-7B-PRIME | 26.7 | 79.2 | 57.8 | 38.6 | 42.1 | 48.9 |
Qwen2.5-7B-SimpleRL-Zero | 33.3 | 77.2 | 62.5 | 33.5 | 37.6 | 48.8 |
Qwen2.5-7B-PURE-PRM+VR* | 20.0 | 82.6 | 82.5 | 37.1 | 44.1 | 53.3 |
Qwen2.5-7B-PURE-PRM | 16.7 | 81.8 | 60.0 | 38.2 | 44.7 | 49.3 |
Qwen2.5-7B-PURE-VR | 23.3 | 79.4 | 60.0 | 36.8 | 41.8 | 48.3 |
*The SOTA model was trained using 8K MATH problems, of which only ~800 gave ground-truth final answers that could be used to calculate VRs.
Note: Eurus-2-7B-PRIME, and SimpleRL-Zero are also based on Qwen-2.5-Math-7B.
We implement our algorithm on two frameworks, OpenRLHF and verl, in 2 different branches respectively. If you are new to our project, we recommend using verl version.
Please follow OpenRLHF's guidance to configure required environments. Then run pip install -r requirements.txt
.
Please refer to the official installation guidance of verl.
We train the PRM in 2 stages using TRL and a preprocessed PRM800K dataset. In the first stage, we freeze the LLM and only train the last score layer (MLP) with 1e-4 learning rate rate for 3 epochs. In the second stage, we unfreeze the LLM and fine-tune all parameters with 1e-6 learning rate for 1 epoch. The resultant PRM is released through HuggingFace.
cd PRM
# stage 1
bash train_stage_1.sh
# stage 2
bash train_stage_2.sh
Switch to the openrlhf branch. Run the following command. The parameter reward_mode
in the script controls the reward type and can be set to PRM
, VR
, and PRMVR
.
bash examples/scripts/train_pure.sh
It uses Ray+vLLM for rollout acceleration, with the first 4 GPUs allocated for the actor, initial actor (reference model), and PRM. The remaining GPUs are used for the vLLM engines. This setup works with 5 to 8 GPUs—just adjust the number of vLLM engines in the script accordingly.
Switch to the verl branch. Set the reward type in the config file:
PURE-VR
usesreward_model.enable=False reward_model.reward_manager=prime
PURE-PRM
usesreward_model.enable=True reward_model.reward_manager=blank
PURE-PRM+VR
usesreward_model.enable=True reward_model.reward_manager=prime
.
Then start training:
python -m verl.trainer.main_ppo
The hybrid engine of verl allows for higher gpu utilization compared to the openrlhf version.
We use Qwen Math's codebase for evaluation (i.e., pass@1 accuracy). For fairness considerations, we completely prohibited solving problems by calling code, following SimpleRL. Please follow the /eval
instructions for evaluation.
- re-implementation on verl
- paper with more discussions and evaluations
- attempts to mitigate reward hacking for PRM (Online PURE)
If you find our code useful, we would appreciate it if you could cite our work:
@article{cheng2025stop,
title={Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning},
author={Cheng, Jie and Qiao, Ruixi and Li, Lijun and Guo, Chao and Wang, Junle and Xiong, Gang and Lv, Yisheng and Wang, Fei-Yue},
journal={arXiv preprint arXiv:2504.15275},
year={2025}
}
We implement our RL algorithm based on OpenRLHF and verl. We thank the developers of OpenRLHF and the author of SimpleRL for discussion! In addition, we also refer to TRL, PRIME's code and hyperparameter values to varying degrees. Thank them for their wonderful work!