TL;DR: We enhance the mathematical reasoning ability of LLMs solely through Verifiable Reward filtering and the self-improvement training paradigm of DPO. The final model, Qwen2.5-7B-DPO-VP, demonstrates mathematical reasoning capabilities comparable to current RL-based approaches. The entire framework does not require model parallelism, enabling replication on a single A800 GPU.
- [2025/03/17] We further incorporated PRM to conduct further self-improvement experiments. For more analysis, see the paper Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation.
- [2025/02/18] We released the code, and the Huggingface model.
Since 2025, reinforcement learning fine-tuning (ReFT) based on Verifiable Reward (VR) has been shown to significantly enhance the mathematical reasoning abilities of sufficiently capable base models. Examples include works like DeepSeek R1, SimpleRL-Zero, and PURE. These methods, based on RL algorithms like PPO or GRPO, still require substantial computational resources, ranging from 8 to dozens of A100 GPUs. Our goal is to enhance the mathematical reasoning abilities of LLMs using the same base models, but with more limited computational resources and without relying on external distillation.
Essentially, works like Direct Preference Optimization (DPO) align with the optimization objectives of RL, aiming to refine the generation distribution of LLMs to favor accepted answers and reject incorrect ones within a given dataset. In DeepSeek R1, we observed that Verifiable Reward is a 0/1 discrete variable, representing whether the output meets predefined correctness criteria. This is similar to the original design of DPO, where correct samples are classified as positive and incorrect ones as negative, thereby constructing Verifiable Pairs to optimize the model.
However, DPO is a purely offline optimization algorithm and does not satisfy the need for self-improvement in models. Drawing on the ideas from Iterative DPO, we propose a self-improvement process based on the Qwen2.5-Math-7B base model. In this process, we perform sampling-filtering to construct preference datasets for self-improvement using a challenging 8K MATH dataset (same as SimpleRL-Zero and PURE), as outlined below:
- In each round of sampling, we select 8 samples for each prompt and filter out one pair of samples to construct the current preference dataset:
- Score each answer: Following the VR scoring method, assign 1 if the answer is correct and formatted, 0 if incorrect but formatted, and -1 if the answer is not in the correct format regardless of correctness.
- Positive sample: If there is a sample with a score of 1, select the one with the longest token length as the positive sample. If there is no score of 1 but there are samples with a score of 0, select the longest 0-score sample as the positive sample. Otherwise, skip the prompt.
- Negative sample: If there is a sample with a score of -1, randomly select one as the negative sample. If there are no -1 samples but there are 0-score samples, randomly select one as the negative sample. Otherwise, skip the prompt.
- After constructing the dataset, perform one epoch of DPO iteration to train the model, and then repeat the process based on the newly trained model.
After fine-tuning, we conducted 6 rounds of DPO iterations, incorporating a temperature-based sampling strategy to increase data diversity: In the first 3 rounds, we set the temperature coefficient to 0.7 for sampling; in rounds 4-5, we set it to 1.0; and in the final round, we set it to 1.2.
Sampling and training were performed on 4 A800 GPUs, with DPO implementing only data parallelism. In theory, training can be done on a single 80GB GPU or even lower configurations. In our 4-GPU experiment, each round of sampling took approximately 2-2.5 hours, and each round of training took about 1 hour. Therefore, the entire process took about 80 hours on A800, and on a single GPU, it would take approximately 3 days to replicate.
The final model achieved an average score of 48.2 on five mathematical reasoning benchmarks, which is comparable to the performance of Qwen2.5-Math-7B-Instruct and other RL-based methods with similar RL data conditions.
All results are in pass@1 accuracy
pass@1 acc | MATH500 | Minerva Math | Olymapaidbench | AMC23 | AIME24 | Avg. |
---|---|---|---|---|---|---|
Qwen2.5-Math-7B * | 64.8 | 15.4 | 25.6 | 37.5 | 16.7 | 32.0 |
Qwen2.5-Math-7B-Instruct * | 83.2 | 33.5 | 38.4 | 62.5 | 20.0 | 47.5 |
rStar-Math-7B ^ | 78.4 | - | 47.1 | 47.5 | 26.7 | - |
Eurus-2-7B-PRIME * | 74.0 | 39.7 | 35.6 | 57.5 | 23.3 | 46.0 |
Qwen2.5-7B-Simple-RL-Zero ^ | 77.2 | 33.5 | 37.6 | 62.5 | 33.3 | 48.8 |
Qwen2.5-7B-PURE-VR * | 79.8 | 36.8 | 41.9 | 60.0 | 20.0 | 47.7 |
Qwen2.5-7B-DPO-VP | 74.8 | 35.3 | 36.9 | 67.5 | 26.7 | 48.2 |
In the table, all models are fine-tuned based on the Qwen2.5-Math-7B base model. Bolded models represent those that were adjusted using the self-improvement method with exactly the same prompts. The results with * are from my own evaluation, and the results with ^ are derived from the corresponding model's technical report. Additionally, we observed that due to Qwen's official evaluation code slicing the model, slight differences may arise when evaluating on different numbers of GPUs. Our model and the reproduced results were both evaluated on 4 A800 GPUs.
Data and GPUs comparison of different approaches
Qwen2.5-Math-7B-Instruct | rStar-Math-7B | Eurus-2-7B-PRIME | Qwen2.5-7B-SimpleRL-Zero | Qwen2.5-7B-PURE-VR | Qwen2.5-7B-DPO-VP | |
---|---|---|---|---|---|---|
Base Model | Qwen2.5-Math-7B | Qwen2.5-Math-7B | Qwen2.5-Math-7B | Qwen2.5-Math-7B | Qwen2.5-Math-7B | Qwen2.5-Math-7B |
SFT Data | 2.5M (open-source and in-house) | ~7.3M (MATH, NuminaMath, etc.) | 230K | 0 | 0 | 0 |
RM Data | 618K (in-house) | ~7k (in-house) | 0 | 0 | 0 | 0 |
RM | Qwen2.5-Math-RM (72B) | None | Eurus-2-7B-SFT | None | None | None |
Self-improve Method | RL + ORM | MCTS + PPM | RL + PRM | RL + VR | RL + VR | DPO + VR |
Self-improve Data | 66K | ~3.647M | 150K | 8K | 8K | 8K |
GPUs | - | 80 H100 at most | 8 A100 | 40 H100 | 8 A100 | 1 A800 or even less |
Accuracy and Token Length
We plotted the accuracy of the iterative process across various datasets against the average token length of the datasets. We observed a consistent increase in accuracy, while the token length for inference did not exhibit an initial rise followed by a decline; instead, it remained relatively stable within a consistent range.
When comparing several different models, the output lengths were approximately equivalent.
Avg. Token Length | MATH500 | Minerva Math | Olymapaidbench | AMC23 | AIME24 | Avg. |
---|---|---|---|---|---|---|
Qwen2.5-Math-7B* | 651 | 577 | 695 | 662 | 748 | 667 |
Qwen2.5-Math-7B-Instruct* | 641 | 650 | 886 | 911 | 1164 | 850 |
Eurus-2-7B-PRIME* | 655 | 822 | 897 | 1020 | 1164 | 912 |
Qwen2.5-7B-Simple-RL* | 588 | 775 | 801 | 767 | 952 | 777 |
Qwen2.5-7B-PURE-VR* | 626 | 646 | 863 | 850 | 1050 | 807 |
Qwen2.5-7B-DPO-VP | 654 | 691 | 924 | 886 | 1251 | 881 |
Aha Moment
We did not observe a significant Aha Moment, which aligns with the findings of PURE. We observed that even when using CoT prompts, Qwen-math still attempts to solve problems using Python code in many instances, although it does not actually execute the code. Additionally, we noticed a re-evaluation phenomenon in the model's responses, such as "Let's re-evaluate ..."; this was already evident in Qwen2.5-math-7B-base, leading us to conclude that DPO does not enable the base model to spontaneously develop self-evaluation capabilities. Instead, it optimizes the model through reward signals to encourage more self-evaluation. We plotted the average number of occurrences of the word "Let's" in each response during the iterative process to measure the model's reflective ability. Compared to the base model, the iterative models showed an increase in reflective ability.
When comparing several different models, we found that Qwen2.5-Math-7B-Instruct had significantly fewer occurrences of "Let's." This indicates a difference in reasoning paradigms between models fine-tuned directly from the base and those trained with the high quality data-driven SFT+RL-RM approach.
Avg. Nums of "Let's" | MATH500 | Minerva Math | Olymapaidbench | AMC23 | AIME24 | Average |
---|---|---|---|---|---|---|
Qwen2.5-Math-7B* | 1.56 | 0.45 | 1.90 | 1.68 | 2.03 | 1.52 |
Qwen2.5-Math-7B-Instruct* | 0.40 | 0.13 | 0.67 | 0.70 | 0.87 | 0.55 |
Eurus-2-7B-PRIME* | 1.50 | 0.96 | 2.27 | 2.40 | 3.30 | 2.09 |
Qwen2.5-7B-Simple-RL* | 1.49 | 0.57 | 1.99 | 1.93 | 2.20 | 1.64 |
Qwen2.5-7B-PURE-VR* | 0.86 | 0.24 | 1.52 | 1.33 | 1.73 | 1.14 |
Qwen2.5-7B-DPO-VP | 1.75 | 1.07 | 2.44 | 2.60 | 3.13 | 2.20 |
Our code is implemented based on OpenRLHF. Please follow OpenRLHF's guidance to configure required environments. Then run pip install -r requirements.txt
For a training cycle, following the code below, then adjust the tempurature in 1., and start a new collect-train cycle.
# 1. collect 8K math data
bash sh/collect_data.sh
# 2. make VR pairs dataset for DPO
bash sh/make_vr_pairs.sh
# 3. train the dpo model
bash sh/train_dpo.sh
# adjust the tempurature in 1., then start a new collect-train cycle.
We used Qwen Math's codebase for evaluation (i.e., pass@1 accuracy).
bash sh/evaluate_all_bench.sh
- Explore more possibilities of long chain data distillation combined with RL/non-RL methods...
If you find our code useful, we would appreciate it if you could cite our work:
@article{tu2025enhancing,
title={Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation},
author={Tu, Songjun and Lin, Jiahao and Tian, Xiangyu and Zhang, Qichao and Li, Linjing and Fu, Yuqian and Xu, Nan and He, Wei and Lan, Xiangyuan and Jiang, Dongmei and Zhao, Dongbin},
journal={arXiv preprint arXiv:2503.12854},
year={2025}
}
We implement our RL algorithm based on OpenRLHF. We thank the developers of OpenRLHF and the authors for discussion. Thanks to all the teachers and students in the DRL Group of the Institute of Automation, Chinese Academy of Sciences for their help; thanks to Jie Cheng from CASIA and Wei He from Fudan University for their advice.
Surprisingly, on February 18, 2025, we discovered that the RLHFlow team's work, Online-DPO-R1, had undertaken nearly identical efforts almost simultaneously (February 16, 2025). Our project originated from discussions during a group meeting on February 13, and prior to that, we had not reviewed the code or technical report of Online-DPO-R1. We are delighted by this coincidence, as it indirectly validates that we have caught up with the research frontier in the field of LLM reasoning. We also extend our gratitude to the RLHFlow team for their selfless contributions to the open-source community!