Skip to content

Commit

Permalink
update: leaderboard to front
Browse files Browse the repository at this point in the history
  • Loading branch information
lockon-n committed Oct 18, 2023
1 parent 6cd5960 commit 7218f25
Showing 1 changed file with 16 additions and 16 deletions.
32 changes: 16 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,22 +49,22 @@ We release the benchmarking results on the pairwise response comparison and crit

For **pairwise comparison task**, the metric is the agreement rate with human preference and consistency rate (not applicable for independent rating methods) when swapping the order of responses. For reward models, we manually search the best threshold for "tie" from 0 to 2.0 in a 0.01 interval. (We slight modify the codes to extract verdicts from the text generation, so the values are slightly different from those in our paper.)

| Model | Type | Generative | Agreement | Consistency |
|----------------------------------------------------------------------------------------------------------------------| -------- |------------------------------------| --------- | ----------- |
| [GPT-4](https://openai.com/research/gpt-4) | Pairwise | <span style="color: green;">✔</span> | 62.28 | 86.28 |
| [Auto-J](https://huggingface.co/GAIR/autoj-13b) | Pairwise | <span style="color: green;">✔</span> | 54.96 | 83.41 |
| [Moss-RM](https://huggingface.co/fnlp/moss-rlhf-reward-model-7B-en) | Single | <span style="color: red;">×</span> | 54.31 | - |
| [Ziya-RM](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-7B-Reward) | Single | <span style="color: red;">×</span> | 53.23 | - |
| [Beaver-RM](https://huggingface.co/PKU-Alignment/beaver-7b-v1.0-reward) | Single | <span style="color: red;">×</span> | 52.37 | - |
| [OASST-RM](https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2) | Single | <span style="color: red;">×</span> | 51.08 | - |
| [LLaMA-2-70B-Chat](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | Pairwise | <span style="color: green;">✔</span> | 46.12 | 69.90 |
| [ChatGPT](https://openai.com/blog/chatgpt) | Pairwise | <span style="color: green;">✔</span> | 42.74 | 62.43 |
| [Claude-2](https://www.anthropic.com/index/claude-2) | Pairwise | <span style="color: green;">✔</span> | 42.6 | 63.43 |
| [SteamSHP](https://huggingface.co/stanfordnlp/SteamSHP-flan-t5-xl) | Pairwise | <span style="color: green;">✔</span> | 40.59 | 65.59 |
| [PandaLM](https://huggingface.co/WeOpenML/PandaLM-7B-v1) | Pairwise | <span style="color: green;">✔</span> | 39.44 | 66.88 |
| [Vicuna-13B-v1.5](https://huggingface.co/lmsys/vicuna-13b-v1.5) | Pairwise | <span style="color: green;">✔</span> | 39.22 | 62.07 |
| [WizardLM-13B-v1.5](https://huggingface.co/WizardLM/WizardLM-13B-V1.2) | Pairwise | <span style="color: green;">✔</span> | 36.35 | 57.69 |
| [LLaMA-2-13B-Chat](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) | Pairwise | <span style="color: green;">✔</span> | 29.81 | 48.56 |
| Model | Type | Generative | Agreement | Consistency |
|----------------------------------------------------------------------------------------------------------------------| -------- |--------------------------------------| --------- | ----------- |
| [GPT-4](https://openai.com/research/gpt-4) | Pairwise | ✔️ | 62.28 | 86.28 |
| [Auto-J](https://huggingface.co/GAIR/autoj-13b) | Pairwise | ✔️ | 54.96 | 83.41 |
| [Moss-RM](https://huggingface.co/fnlp/moss-rlhf-reward-model-7B-en) | Single | | 54.31 | - |
| [Ziya-RM](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-7B-Reward) | Single | | 53.23 | - |
| [Beaver-RM](https://huggingface.co/PKU-Alignment/beaver-7b-v1.0-reward) | Single | | 52.37 | - |
| [OASST-RM](https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2) | Single | | 51.08 | - |
| [LLaMA-2-70B-Chat](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | Pairwise | ✔️ | 46.12 | 69.90 |
| [ChatGPT](https://openai.com/blog/chatgpt) | Pairwise | ✔️ | 42.74 | 62.43 |
| [Claude-2](https://www.anthropic.com/index/claude-2) | Pairwise | ✔️ | 42.6 | 63.43 |
| [SteamSHP](https://huggingface.co/stanfordnlp/SteamSHP-flan-t5-xl) | Pairwise | ✔️ | 40.59 | 65.59 |
| [PandaLM](https://huggingface.co/WeOpenML/PandaLM-7B-v1) | Pairwise | ✔️ | 39.44 | 66.88 |
| [Vicuna-13B-v1.5](https://huggingface.co/lmsys/vicuna-13b-v1.5) | Pairwise | ✔️ | 39.22 | 62.07 |
| [WizardLM-13B-v1.5](https://huggingface.co/WizardLM/WizardLM-13B-V1.2) | Pairwise | ✔️ | 36.35 | 57.69 |
| [LLaMA-2-13B-Chat](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) | Pairwise | ✔️ | 29.81 | 48.56 |



Expand Down

0 comments on commit 7218f25

Please sign in to comment.