Evaluating Robustness of Reward Models for Mathematical Reasoning

Full dataset can be found at: https://huggingface.co/datasets/RewardMATH/RewardMATH

Installation

pip install -r requirements.txt

To evaluate results, MARIO EVAL needs to be installed.

Install MARIO EVAL as Python package

git clone https://github.com/MARIO-Math-Reasoning/MARIO_EVAL.git
cd MARIO_EVAL
cd latex2sympy && pip install . && cd ..
pip install -e .

Evaluating Reward Models

Generative Reward Model (LLM-as-a-judge)

Direct Assessment

To run api models (e.g., claude-3-5-sonnet-20240620) using direct assessment, run:

python src/inference_reward.py \
    --input_path=dataset/benchmark/RewardMATH_direct.json \
    --save_path=YOUR_SAVE_PATH \
    --model_name=claude-3-5-sonnet-20240620 \
    --api_key=YOUR_API_KEY \
    --prompt_dir=prompt/experiments_prompts.yaml \
    --prompt_key=llm_judgement \
    --model_type=generative \
    # --num_sample=10

To run models with vllm (e.g., meta-llama/Meta-Llama-3-70B-Instruct), run:

CUDA_VISIBLE_DEVICES=0,1,2,3 python src/inference_reward.py \
    --input_path=dataset/benchmark/RewardMATH_direct.json \
    --save_path=YOUR_SAVE_PATH \
    --model_name=meta-llama/Meta-Llama-3-70B-Instruct \
    --prompt_dir=prompt/experiments_prompts.yaml \
    --prompt_key=llm_judgement_pair \
    --model_type=generative \
    --num_gpus 4 \
    # --num_sample=10

Pairwise Comparison

To run api models (e.g., claude-3-5-sonnet-20240620) using pairwise comparison, run:

python src/inference_reward.py \
    --input_path=dataset/benchmark/RewardMATH_pairwise.json \
    --save_path=YOUR_SAVE_PATH \
    --model_name=claude-3-5-sonnet-20240620 \
    --api_key=YOUR_API_KEY \
    --prompt_dir=prompt/experiments_prompts.yaml \
    --prompt_key=llm_judgement \
    --model_type=generative \
    # --num_sample=10

To run models with vllm (e.g., meta-llama/Meta-Llama-3-70B-Instruct), run:

CUDA_VISIBLE_DEVICES=0,1,2,3 python src/inference_reward.py \
    --input_path=dataset/benchmark/RewardMATH_pairwise.json \
    --save_path=YOUR_SAVE_PATH \
    --model_name=meta-llama/Meta-Llama-3-70B-Instruct \
    --pairwise_exp \
    --prompt_dir=prompt/experiments_prompts.yaml \
    --prompt_key=llm_judgement_pair \
    --model_type=generative \
    --num_gpus 4 \
    # --num_sample=10

Classifier-based Reward Model

To run classifier-based reward models, run:

CUDA_VISIBLE_DEVICES=0 python src/inference_reward.py \
    --input_path=dataset/benchmark/RewardMATH_direct.json \
    --save_path=YOUR_SAVE_PATH \
    --model_name=RLHFlow/ArmoRM-Llama3-8B-v0.1 \
    --model_type=classifier \
    --trust_remote_code \
    --batch_size=8 \
    # --num_sample=10

Process Reward Model (PRM)

To run PRMs, run:

CUDA_VISIBLE_DEVICES=0,1,2,3 python src/inference_reward.py \
    --input_path=dataset/benchmark/RewardMATH_direct.json \
    --save_path=YOUR_SAVE_PATH \
    --model_name=peiyi9979/math-shepherd-mistral-7b-prm \
    --model_type=prm \
    # --num_sample=10

Getting Benchmark Scores

### Results of direct assessment (default)
python src/evaluate_results.py \
    --result_dir=YOUR_RESULTS_FILES \
    --eval_mode=our_reward \

### Results of direct assessment (for PRM)
python src/evaluate_results.py \
    --result_dir=YOUR_RESULTS_FILES \
    --eval_mode=our_reward \
    --prm_mode \

### Results of pairwise comparison
python src/evaluate_results.py \
    --result_dir=YOUR_RESULTS_FILES \
    --eval_mode=our_reward \
    --pairwise

Repository Structure

├── dataset/                    <- Stores the benchmarks (RewardMATH, RewardMATH-pairwise, Rewardbench) and math test sets.
├── prompt/                     <- Contains files related to input prompts.
├── results/                    <- Holds output files from model evaluations.
├── scripts/                    <- Includes various scripts used for experiments.
├── src/                        <- Source code directory for the project.
|   ├── evaluation/                  ├── Contains scripts and modules for model evaluation.
|   ├── models/                      ├── Module for existing reward models.
|   ├── utils/                       ├── Utility scripts and helper functions used across the project.
|   ├── evaluate_results.py          ├── Get results for evaluating reward models on benchamrk and other experiments.
|   ├── exp_best_of_n.py             ├── Script for Best-of-N sampling.
|   ├── inference_reward.py          ├── Inference reward models.
|   └── prompts.py                   └── Manages and generates prompts.
└── tests.py                    <- Unit tests for the project's modules, ensuring the correctness of the code.

👏 Acknowledgements

The underlying codebase for evaluating reward model from RewardBench.

Citation

@misc{kim2024evaluatingrobustnessrewardmodels,
      title={Evaluating Robustness of Reward Models for Mathematical Reasoning}, 
      author={Sunghwan Kim and Dongjin Kang and Taeyoon Kwon and Hyungjoo Chae and Jungsoo Won and Dongha Lee and Jinyoung Yeo},
      year={2024},
      eprint={2410.01729},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.01729}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Robustness of Reward Models for Mathematical Reasoning

Installation

Install MARIO EVAL as Python package

Evaluating Reward Models

Generative Reward Model (LLM-as-a-judge)

Direct Assessment

Pairwise Comparison

Classifier-based Reward Model

Process Reward Model (PRM)

Getting Benchmark Scores

Repository Structure

👏 Acknowledgements

Citation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dataset		dataset
prompt		prompt
results		results
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

kimsh0507/RewardMATH_official

Folders and files

Latest commit

History

Repository files navigation

Evaluating Robustness of Reward Models for Mathematical Reasoning

Installation

Install MARIO EVAL as Python package

Evaluating Reward Models

Generative Reward Model (LLM-as-a-judge)

Direct Assessment

Pairwise Comparison

Classifier-based Reward Model

Process Reward Model (PRM)

Getting Benchmark Scores

Repository Structure

👏 Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages