Arena-Hard-Local

Full Leaderboard* (Updated: 09/23)

Athene-70B-AWQ                                     | score: 95.9  | 95% CI:   (-0.7, 0.6)   | average #tokens: 997
gemma-2-9b-it-SimPO                                | score: 90.9  | 95% CI:   (-1.2, 1.1)   | average #tokens: 1065
Qwen2.5-72B-Instruct-AWQ                           | score: 90.3  | 95% CI:   (-1.0, 1.2)   | average #tokens: 1043
mistral-large-instruct-2407-awq                    | score: 83.8  | 95% CI:   (-1.8, 1.4)   | average #tokens: 853
Llama-3-Instruct-8B-SimPO                          | score: 83.0  | 95% CI:   (-1.2, 1.5)   | average #tokens: 545
gemma-2-27b-it-FP8                                 | score: 82.0  | 95% CI:   (-1.6, 1.5)   | average #tokens: 799
T-lite-instruct-0.1                                | score: 79.4  | 95% CI:   (-1.6, 1.2)   | average #tokens: 1056
Mistral-Small-Instruct-2409-awq                    | score: 78.2  | 95% CI:   (-1.5, 1.3)   | average #tokens: 826
SELM-Llama-3-8B-Instruct-iter-3                    | score: 77.9  | 95% CI:   (-1.6, 1.4)   | average #tokens: 606
Qwen2.5-32B-Instruct-AWQ                           | score: 76.6  | 95% CI:   (-2.3, 1.6)   | average #tokens: 740
Vikhr-Nemo-12B-Instruct-R-21-09-24                 | score: 76.6  | 95% CI:   (-1.5, 1.5)   | average #tokens: 925
Meta-Llama-3-70B-Instruct-GPTQ                     | score: 74.9  | 95% CI:   (-1.5, 1.5)   | average #tokens: 568
saiga_llama3_70b_abl_kto_m1_d2_awq                 | score: 74.8  | 95% CI:   (-2.1, 1.8)   | average #tokens: 673
Llama-3-70B-Instruct-AH-AWQ                        | score: 74.3  | 95% CI:   (-1.8, 1.7)   | average #tokens: 791
saiga_llama3_70b_abliterated-AWQ                   | score: 74.2  | 95% CI:   (-1.6, 2.0)   | average #tokens: 672
suzume-llama-3-8B-multilingual-orpo-borda-half     | score: 71.7  | 95% CI:   (-1.8, 1.9)   | average #tokens: 978
Llama-3.1-70B-Instruct-AWQ                         | score: 71.7  | 95% CI:   (-1.5, 1.8)   | average #tokens: 761
llama-3.1-70b-instruct                             | score: 70.2  | 95% CI:   (-1.5, 1.5)   | average #tokens: 728
Meta-Llama-3-8B-Instruct-f16                       | score: 69.3  | 95% CI:   (-1.9, 1.8)   | average #tokens: 560
gemma-2-9b-it                                      | score: 67.0  | 95% CI:   (-2.0, 1.8)   | average #tokens: 760
saiga_llama3_8b__kto_m5_d3                         | score: 58.1  | 95% CI:   (-1.6, 2.0)   | average #tokens: 867
suzume-llama-3-8B-multilingual                     | score: 54.0  | 95% CI:   (-1.8, 1.8)   | average #tokens: 767
aya-23-8B                                          | score: 51.4  | 95% CI:   (-1.9, 2.2)   | average #tokens: 834
Phi-3-medium-128k-instruct                         | score: 50.0  | 95% CI:   (0.0, 0.0)    | average #tokens: 801
Mistral-Nemo-Instruct-2407                         | score: 49.5  | 95% CI:   (-1.7, 1.9)   | average #tokens: 660
gemma-2-2b-it-abl                                  | score: 48.8  | 95% CI:   (-2.0, 1.5)   | average #tokens: 783
Llama-3.1-8B-Instruct                              | score: 43.8  | 95% CI:   (-1.9, 1.8)   | average #tokens: 847
Qwen2-7B-Instruct                                  | score: 35.0  | 95% CI:   (-2.0, 2.1)   | average #tokens: 554
Vikhr-7B-instruct_0.5                              | score: 15.9  | 95% CI:   (-1.6, 1.3)   | average #tokens: 794
alpindale_gemma-2b-it                              | score:  8.8  | 95% CI:   (-1.0, 1.2)   | average #tokens: 425

*with original arena-hard questions translated into Russian. Llama-3-70B-Instruct-AH-AWQ as judge

Arena-Hard-Local

Arena-Hard-Local is an adaptation of lm-sys/arena-hard-auto to use the local model as a judge and also for multilingual use (Russian by default). The repository uses a simplified evaluation prompt in which the model does not have to also answer the question that the evaluated models answered. This was done intentionally, since first of all the project is intended for quick evaluation of checkpoints after fine-tuning. In addition, we noticed that the performance of the local 70b model is sometimes not enough for a full prompt. For full evaluation, please use the original repository.

Check out lmsys blog post for more details about how Arena Hard Auto v0.1 works -> Blog post link.

Local Model

To evaluate models, it is proposed to use Qwen2-72B-Instruct-AWQ or Llama-3-70B-Instruct-AH-AWQ, modified for these purposes, trained on the judge’s responses of a stronger model - gpt-4-1106-preview. You can also configure the repository to work with any other judge model.

We use original arena-hard questions translated into Russian: Vikhrmodels/arena_hard_ru. To use another language, replace the file data/arena-hard-v0.1/question.jsonl and make sure that the resulting json is saved with the required encoding (line 103 in gen_answer.py).

This code also uses a different default baseline model - Phi-3-medium-128k-instruct instead of gpt-4-0314.

Llama-3-70B-Instruct-AH-AWQ as judge:

Llama-3-Instruct-8B-SimPO                          | score: 83.0  | 95% CI:   (-1.5, 1.4)   | average #tokens: 545
SELM-Llama-3-8B-Instruct-iter-3                    | score: 77.9  | 95% CI:   (-1.5, 1.7)   | average #tokens: 606
suzume-llama-3-8B-multilingual-orpo-borda-half     | score: 71.7  | 95% CI:   (-1.7, 1.6)   | average #tokens: 978
Meta-Llama-3-8B-Instruct-f16                       | score: 69.3  | 95% CI:   (-1.9, 2.1)   | average #tokens: 560
suzume-llama-3-8B-multilingual                     | score: 54.0  | 95% CI:   (-2.0, 1.7)   | average #tokens: 767
aya-23-8B                                          | score: 51.4  | 95% CI:   (-2.1, 1.8)   | average #tokens: 834
Phi-3-medium-128k-instruct                         | score: 50.0  | 95% CI:   (0.0, 0.0)    | average #tokens: 801
Qwen2-7B-Instruct                                  | score: 35.0  | 95% CI:   (-1.5, 2.2)   | average #tokens: 554
Vikhr-7B-instruct_0.5                              | score: 15.9  | 95% CI:   (-1.1, 1.0)   | average #tokens: 794
alpindale_gemma-2b-it                              | score:  8.8  | 95% CI:   (-0.9, 0.9)   | average #tokens: 425

Qwen2-72B-Instruct-AWQ as judge:

Llama-3-Instruct-8B-SimPO                          | score: 66.3  | 95% CI:   (-2.0, 1.7)   | average #tokens: 545
suzume-llama-3-8B-multilingual-orpo-borda-half     | score: 65.1  | 95% CI:   (-1.6, 1.7)   | average #tokens: 978
SELM-Llama-3-8B-Instruct-iter-3                    | score: 62.6  | 95% CI:   (-1.9, 1.7)   | average #tokens: 606
Meta-Llama-3-8B-Instruct-f16                       | score: 57.5  | 95% CI:   (-1.8, 1.7)   | average #tokens: 560
suzume-llama-3-8B-multilingual                     | score: 54.7  | 95% CI:   (-1.5, 1.7)   | average #tokens: 767
Phi-3-medium-128k-instruct                         | score: 50.0  | 95% CI:   (0.0, 0.0)    | average #tokens: 801
aya-23-8B                                          | score: 48.1  | 95% CI:   (-2.0, 1.7)   | average #tokens: 834
Qwen2-7B-Instruct                                  | score: 40.8  | 95% CI:   (-1.7, 1.8)   | average #tokens: 554
Vikhr-7B-instruct_0.5                              | score: 22.3  | 95% CI:   (-1.3, 1.5)   | average #tokens: 794
alpindale_gemma-2b-it                              | score: 10.3  | 95% CI:   (-1.1, 0.9)   | average #tokens: 425

Install Dependencies

git clone https://github.com/r4dm/arena-hard-local.git
cd arena-hard-local
pip install -r requirements.txt
pip install -r requirements-optional.txt  # Optional dependencies (e.g., anthropic sdk)

Then install vLLM using the instructions for your platform.

Run bench

Just run

python3 show_result.py --judge-name Llama-3-70B-Instruct-AH-AWQ --baseline Phi-3-medium-128k-instruct

Running show_results.py will save generated battles into data/arena_hard_battles.jsonl and bootstrapping statistics into data/bootstrapping_results.jsonl. If you don't want to regenerate battles or bootstrapping statistics, simply toggle argument --load-battles or --load-bootstrap, respectively.

Evaluate a new model on Arena-Hard-Local

Step 1. Set up the endpoint config to your model

Fill in your API endpoint in config/api_config.yaml. We support OpenAI compatible API server. You can specify parallel to indicate the number of concurrent API requests (default: 1).

# example
aya-23-8B:
     model_name: /home/jupyter/models/aya-23-8B
     endpoints:
         - api_base: http://localhost:8010/v1
           api_key: token-abc123
     api_type: openai
     parallel: 16

You may use inference engine such as vLLM or SGLang to host your model with an OpenAI compatible API server.

Step 2. Run vLLM for Answers

Change to your path and model name:

python3 -m vllm.entrypoints.openai.api_server --model /home/jupyter/models/aya-23-8B --api-key token-abc123 --port 8010

Step 3. Generate Model Answers

In config/gen_answer_config.yaml, add your model name in model_list.

bench_name: arena-hard-v0.1
temperature: 0.0
max_tokens: 4096
num_choices: 1

model_list:
  - [YOUR-MODEL-NAME]

Run the command to generate answers:

python3 gen_answer.py

Caching feature is implemented. The code will skip generating an answer when there is already an existing answer/judgment to the same prompt.

Step 4. Run vLLM for Judgments

Change to your path and model name:

python3 -m vllm.entrypoints.openai.api_server --model /home/jupyter/models/Llama-3-70B-Instruct-AH-AWQ --api-key token-abc123 --port 8010

Step 5. Generate Judgments

In config/judge_config.yaml, add your model name in model_list.

...
# Add your model below for evaluation
model_list:
  - aya-23-8B
  - [YOUR-MODEL-NAME]

and judge model name.

Run the command to generate judgments:

python3 gen_judgment.py

Judgment caching is also implemented. It will skip generating judgments that has already been generated or lacks one of the model answers.

Step 6. Show result

Output model win rates. Optionally, use --full-stats for detailed results.

python3 show_result.py --judge-name Llama-3-70B-Instruct-AH-AWQ --baseline Phi-3-medium-128k-instruct

Step 7. Arena Hard UI

You can review individual judgment results using our UI code.

python3 qa_broswer.py --share

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
config		config
data		data
misc		misc
notebook		notebook
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
gen_answer.py		gen_answer.py
gen_judgment.py		gen_judgment.py
qa_browser.py		qa_browser.py
requirements-optional.txt		requirements-optional.txt
requirements.txt		requirements.txt
show_result.py		show_result.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Full Leaderboard* (Updated: 09/23)

Arena-Hard-Local

Local Model

Llama-3-70B-Instruct-AH-AWQ as judge:

Qwen2-72B-Instruct-AWQ as judge:

Install Dependencies

Run bench

Evaluate a new model on Arena-Hard-Local

Step 1. Set up the endpoint config to your model

Step 2. Run vLLM for Answers

Step 3. Generate Model Answers

Step 4. Run vLLM for Judgments

Step 5. Generate Judgments

Step 6. Show result

Step 7. Arena Hard UI

About

Releases

Packages

Contributors 3

Languages

License

r4dm/arena-hard-local

Folders and files

Latest commit

History

Repository files navigation

Full Leaderboard* (Updated: 09/23)

Arena-Hard-Local

Local Model

Llama-3-70B-Instruct-AH-AWQ as judge:

Qwen2-72B-Instruct-AWQ as judge:

Install Dependencies

Run bench

Evaluate a new model on Arena-Hard-Local

Step 1. Set up the endpoint config to your model

Step 2. Run vLLM for Answers

Step 3. Generate Model Answers

Step 4. Run vLLM for Judgments

Step 5. Generate Judgments

Step 6. Show result

Step 7. Arena Hard UI

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages