Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

🎉 News

[2025/04/22] 🔥🔥Released our Paper on arXiv. See here🔥🔥
[2025/03/24] We re-implement our algorithm based on verl. ✨✨ Key features: (1) add ~50 additional metrics to comprehensively monitor the training process and stability, (2) add a custom wandb workerspace to monitor ~20 important metrics, (3) add curriculum learning.✨✨
[2025/02/22] We release the notion blog, which details our algorithm, the difference between gamma-decay and min-form credit assignment, examples of reward hacking, and so on.
[2025/02/09] We release the training, evaluation code, wandb, and checkpoints. Paper's on it's way!

📖 Introduction

This month, we saw a huge boost in LLM reasoning power from the verifiable reward (VR)-based Reinforcement learning fine-tuning (ReFT). Previous work has encountered challenges and made unsuccessful attempts in exploring PRM, so we wonder: How far can PRM actually take us? How does it stack up against VR-based methods in reasoning performance, training costs?

To answer these questions, we present PURE (Process-sUpervised Reinforcement lEarning). Employing Qwen2.5-Math-7B as the base model, we train a PRM using PRM800K dataset, and then fine-tune another Qwen2.5-Math-7B model using only 8K MATH prompts, process rewards from the PRM, and optional verifiable rewards. For the RL algorithm, we use the PPO loss with an RLOO advantage estimator. We improve credit assignment by using a weighted sum of the process rewards, $\sum_t \text{softmax}(-\text{PR}_t/T)\cdot\text{PR}_t$ which approximates ${\min}_t \text{PR}_t$ when $T\rightarrow 0$, instead of the usual gamma decay sum $\sum_t \gamma^t \cdot \text{PR}_t$ to calculate return. Our framework supports multiple reward types: only process reward (PURE-PRM), only verifiable reward (PURE-VR) which is the Deepseek-R1-zero's setting, or a mix of both (PURE-PRM+VR), as shown in the following table.

📊 The final model achieves pass@1 accuracy of 82.6% on MATH500, 82.5% on AMC, and 53.3% on average across 5 benchmarks, beating Qwen2.5-math-7B-instruct, PRIME, and SimpleRL with just either <1/50th RL data or 1/5th of the compute resources.

All results are in pass@1 accuracy

	AIME 2024	MATH 500	AMC	Minerva Math	OlympiadBench	Avg.
Qwen2.5-Math-7B-Base	13.3	71.8	47.5	29.8	35.1	39.5
Qwen-2.5-Math-7B-Instruct	16.7	83.2	52.5	37.5	41.3	46.2
Eurus-2-7B-PRIME	26.7	79.2	57.8	38.6	42.1	48.9
Qwen2.5-7B-SimpleRL-Zero	33.3	77.2	62.5	33.5	37.6	48.8
Qwen2.5-7B-PURE-PRM+VR*	20.0	82.6	82.5	37.1	44.1	53.3
Qwen2.5-7B-PURE-PRM	16.7	81.8	60.0	38.2	44.7	49.3
Qwen2.5-7B-PURE-VR	23.3	79.4	60.0	36.8	41.8	48.3

*The SOTA model was trained using 8K MATH problems, of which only ~800 gave ground-truth final answers that could be used to calculate VRs.

Note: Eurus-2-7B-PRIME, and SimpleRL-Zero are also based on Qwen-2.5-Math-7B.

🔧 Quick Start

We implement our algorithm on two frameworks, OpenRLHF and verl, in 2 different branches respectively. If you are new to our project, we recommend using verl version.

Installation

1. OpenRLHF version

Please follow OpenRLHF's guidance to configure required environments. Then run pip install -r requirements.txt.

2. verl version

Please refer to the official installation guidance of verl.

Training of PRM

We train the PRM in 2 stages using TRL and a preprocessed PRM800K dataset. In the first stage, we freeze the LLM and only train the last score layer (MLP) with 1e-4 learning rate rate for 3 epochs. In the second stage, we unfreeze the LLM and fine-tune all parameters with 1e-6 learning rate for 1 epoch. The resultant PRM is released through HuggingFace.

cd PRM
# stage 1
bash train_stage_1.sh
# stage 2
bash train_stage_2.sh

Training of LLM

1. OpenRLHF version

Switch to the openrlhf branch. Run the following command. The parameter reward_mode in the script controls the reward type and can be set to PRM, VR, and PRMVR.

bash examples/scripts/train_pure.sh

It uses Ray+vLLM for rollout acceleration, with the first 4 GPUs allocated for the actor, initial actor (reference model), and PRM. The remaining GPUs are used for the vLLM engines. This setup works with 5 to 8 GPUs—just adjust the number of vLLM engines in the script accordingly.

2. verl version

Switch to the verl branch. Set the reward type in the config file:

PURE-VR uses reward_model.enable=False reward_model.reward_manager=prime
PURE-PRM uses reward_model.enable=True reward_model.reward_manager=blank
PURE-PRM+VR uses reward_model.enable=True reward_model.reward_manager=prime.

Then start training:

python -m verl.trainer.main_ppo

The hybrid engine of verl allows for higher gpu utilization compared to the openrlhf version.

Evaluation

We use Qwen Math's codebase for evaluation (i.e., pass@1 accuracy). For fairness considerations, we completely prohibited solving problems by calling code, following SimpleRL. Please follow the /eval instructions for evaluation.

📝 TODO:

re-implementation on verl
paper with more discussions and evaluations
attempts to mitigate reward hacking for PRM (Online PURE)

🎈 Citation

If you find our code useful, we would appreciate it if you could cite our work:

@article{cheng2025stop,
  title={Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning},
  author={Cheng, Jie and Qiao, Ruixi and Li, Lijun and Guo, Chao and Wang, Junle and Xiong, Gang and Lv, Yisheng and Wang, Fei-Yue},
  journal={arXiv preprint arXiv:2504.15275},
  year={2025}
}

🌻 Acknowledgement

We implement our RL algorithm based on OpenRLHF and verl. We thank the developers of OpenRLHF and the author of SimpleRL for discussion! In addition, we also refer to TRL, PRIME's code and hyperparameter values to varying degrees. Thank them for their wonderful work!

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github		.github
PRM		PRM
eval		eval
examples		examples
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.style.yapf		.style.yapf
README.md		README.md
aime2024.parquet		aime2024.parquet
math500.parquet		math500.parquet
nccl.conf		nccl.conf
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
train.parquet		train.parquet
val.parquet		val.parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

🎉 News

📖 Introduction

🔧 Quick Start

Installation

1. OpenRLHF version

2. verl version

Training of PRM

Training of LLM

1. OpenRLHF version

2. verl version

Evaluation

📝 TODO:

🎈 Citation

🌻 Acknowledgement

About

Releases

Packages

Contributors 2

Languages

CJReinforce/PURE

Folders and files

Latest commit

History

Repository files navigation

Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

🎉 News

📖 Introduction

🔧 Quick Start

Installation

1. OpenRLHF version

2. verl version

Training of PRM

Training of LLM

1. OpenRLHF version

2. verl version

Evaluation

📝 TODO:

🎈 Citation

🌻 Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages