Skip to content

Commit

Permalink
readme
Browse files Browse the repository at this point in the history
  • Loading branch information
karina-openai committed Oct 30, 2024
1 parent 235cad0 commit 1ebafa5
Showing 1 changed file with 24 additions and 24 deletions.
48 changes: 24 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,31 +4,31 @@ We are open sourcing it so we can be transparent about the accuracy numbers we'r

## Benchmark Results

| Model | Prompt | MMLU | GPQA | MATH | HumanEval | MGSM[^5] | DROP[^5]<br>(F1, 3-shot) |
|:----------------------------:|:-------------:|:------:|:------:|:------:|:---------:|:------:|:------------------------:|
| **o1** | | | | MATH-500[^6] | | | |
| o1-preview | n/a[^7] | 90.8 | 73.3 | 85.5 | **`92.4`** | 90.8 | 74.8 |
| o1-mini | n/a | 85.2 | 60.0 | 90.0 | **`92.4`** | 89.9 | 83.9 |
| o1 (work in progress) | n/a | **`92.3`** | **`77.3`** | **`94.8`** | n/a | n/a | n/a |
| Model | Prompt | MMLU | GPQA | MATH | HumanEval | MGSM[^5] | DROP[^5]<br>(F1, 3-shot) | SimpleQA
|:----------------------------:|:-------------:|:------:|:------:|:------:|:---------:|:------:|:--------------------------:|:---------:|
| **o1** | | | | MATH-500[^6] | | | |
| o1-preview | n/a[^7] | 90.8 | 73.3 | 85.5 | **`92.4`** | 90.8 | 74.8 | **`42.4`** |
| o1-mini | n/a | 85.2 | 60.0 | 90.0 | **`92.4`** | 89.9 | 83.9 | 7.6 |
| o1 (work in progress) | n/a | **`92.3`** | **`77.3`** | **`94.8`** | n/a | n/a | n/a | n/a
| **GPT-4o** | | | | | | | |
| gpt-4o-2024-08-06 | assistant[^2] | 88.7 | 53.1 | 75.9 | 90.2 | 90.0 | 79.8 |
| gpt-4o-2024-05-13 | assistant | 87.2 | 49.9 | 76.6 | 91.0 | 89.9 | 83.7 |
| gpt-4o-mini-2024-07-18 | assistant | 82.0 | 40.2 | 70.2 | 87.2 | 87.0 | 79.7 |
| **GPT-4 Turbo and GPT-4** | | | | | | | |
| gpt-4-turbo-2024-04-09 | assistant | 86.7 | 49.3 | 73.4 | 88.2 | 89.6 | 86.0 |
| gpt-4-0125-preview | assistant | 85.4 | 41.4 | 64.5 | 86.6 | 85.1 | 81.5 |
| gpt-4-1106-preview | assistant | 84.7 | 42.5 | 64.3 | 83.7 | 87.1 | 83.2 |
| **Other Models (Reported)** | | | | | | | |
| [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet) | unknown | 88.3 | 59.4 | 71.1 | 92.0 | **`91.6`** | **`87.1`** |
| [Claude 3 Opus](https://www.anthropic.com/news/claude-3-family) | unknown | 86.8 | 50.4 | 60.1 | 84.9 | 90.7 | 83.1 |
| [Llama 3.1 405b](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md) | unknown | 88.6 | 50.7 | 73.8 | 89.0 | **`91.6`** | 84.8 |
| [Llama 3.1 70b](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md) | unknown | 82.0 | 41.7 | 68.0 | 80.5 | 86.9 | 79.6 |
| [Llama 3.1 8b](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md) | unknown | 68.4 | 30.4 | 51.9 | 72.6 | 68.9 | 59.5 |
| [Grok 2](https://x.ai/blog/grok-2) | unknown | 87.5 | 56.0 | 76.1 | 88.4 | n/a | n/a |
| [Grok 2 mini](https://x.ai/blog/grok-2) | unknown | 86.2 | 51.0 | 73.0 | 85.7 | n/a | n/a |
| [Gemini 1.0 Ultra](https://goo.gle/GeminiV1-5) | unknown | 83.7 | n/a | 53.2 | 74.4 | 79.0 | 82.4 |
| [Gemini 1.5 Pro](https://goo.gle/GeminiV1-5) | unknown | 81.9 | n/a | 58.5 | 71.9 | 88.7 | 78.9 |
| [Gemini 1.5 Flash](https://goo.gle/GeminiV1-5) | unknown | 77.9 | 38.6 | 40.9 | 71.5 | 75.5 | 78.4 |
| gpt-4o-2024-08-06 | assistant[^2] | 88.7 | 53.1 | 75.9 | 90.2 | 90.0 | 79.8 | 40.1 |
| gpt-4o-2024-05-13 | assistant | 87.2 | 49.9 | 76.6 | 91.0 | 89.9 | 83.7 | 39.0 |
| gpt-4o-mini-2024-07-18 | assistant | 82.0 | 40.2 | 70.2 | 87.2 | 87.0 | 79.7 | 9.5 |
| **GPT-4 Turbo and GPT-4** | | | | | | | |
| gpt-4-turbo-2024-04-09 | assistant | 86.7 | 49.3 | 73.4 | 88.2 | 89.6 | 86.0 | 24.2 |
| gpt-4-0125-preview | assistant | 85.4 | 41.4 | 64.5 | 86.6 | 85.1 | 81.5 | n/a
| gpt-4-1106-preview | assistant | 84.7 | 42.5 | 64.3 | 83.7 | 87.1 | 83.2 | n/a
| **Other Models (Reported)** | | | | | | | |
| [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet) | unknown | 88.3 | 59.4 | 71.1 | 92.0 | **`91.6`** | **`87.1`** | 28.9 |
| [Claude 3 Opus](https://www.anthropic.com/news/claude-3-family) | unknown | 86.8 | 50.4 | 60.1 | 84.9 | 90.7 | 83.1 | 23.5 |
| [Llama 3.1 405b](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md) | unknown | 88.6 | 50.7 | 73.8 | 89.0 | **`91.6`** | 84.8 | n/a
| [Llama 3.1 70b](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md) | unknown | 82.0 | 41.7 | 68.0 | 80.5 | 86.9 | 79.6 | n/a
| [Llama 3.1 8b](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md) | unknown | 68.4 | 30.4 | 51.9 | 72.6 | 68.9 | 59.5 | n/a
| [Grok 2](https://x.ai/blog/grok-2) | unknown | 87.5 | 56.0 | 76.1 | 88.4 | n/a | n/a | n/a
| [Grok 2 mini](https://x.ai/blog/grok-2) | unknown | 86.2 | 51.0 | 73.0 | 85.7 | n/a | n/a | n/a
| [Gemini 1.0 Ultra](https://goo.gle/GeminiV1-5) | unknown | 83.7 | n/a | 53.2 | 74.4 | 79.0 | 82.4 | n/a
| [Gemini 1.5 Pro](https://goo.gle/GeminiV1-5) | unknown | 81.9 | n/a | 58.5 | 71.9 | 88.7 | 78.9 | n/a
| [Gemini 1.5 Flash](https://goo.gle/GeminiV1-5) | unknown | 77.9 | 38.6 | 40.9 | 71.5 | 75.5 | 78.4 | n/a

## Background

Expand Down

0 comments on commit 1ebafa5

Please sign in to comment.