OpenAI o1-preview and o1-mini #21

EwoutH · 2024-09-17T19:32:55Z

Very curious about OpenAI o1-preview and o1-mini MMLU-Pro scores. Opening this issue as a tracking issue that people can follow and updates can be shared in.

billbradley · 2024-10-26T11:46:25Z

I headed to the issues page to post the same request, so instead I will cheer on this one :)

In case it may be helpful: as compared with other LLMs, o1-preview is (1) remarkably slow, (2) remarkably expensive and (3) remarkably different (and sometimes better) than any other LLM I've used. I'm extremely interested in finding out how it performs on MMLU-Pro.

billbradley · 2024-10-28T13:33:02Z

Hmm... This tweet would seem to suggest that the researchers at TIGER have already evaluated o1-preview last month:
https://x.com/WenhuChen/status/1834605218018754581

We would love to see this data reflected in the GitHub repo!

wenhuchen · 2024-10-28T14:26:44Z

We were capped by the request/day before. But we can definitely rerun it now with higher quota.

EwoutH · 2024-12-05T19:08:16Z

Curious if this still can be done, even if only for a historical comparison once the full o1 is out.

These are the first "reasoning" models in history, so it would be nice to have them for future reference.

EwoutH · 2024-12-17T22:15:01Z

Thanks for adding all the models in the last few days!

Regular o1 is now also available in the API. Might be a nice moment to test all three.

wenhuchen · 2024-12-18T01:37:15Z

It's very expensive. We are not sure whether we have the fund to run it.

billbradley · 2024-12-18T04:01:03Z

You know, if the accuracy for o1 is in the neighborhood of 80%, then with 1600 samples I think the standard deviation should be about:

+/- sqrt(0.8 * 0.2 / 1600) = +/- 1%

So, what do you think about evaluating on a random subsample of 1600? The approximate answer might still tell us a lot.

EwoutH · 2024-12-18T07:09:48Z

Would using the Batch API also help?

billbradley · 2025-01-18T20:27:58Z

Of course, the landscape has changed since this question was first asked (with o1 fairly accessible, o3 in the wings, and other reasoning models in play), but it would still be very interesting to learn a little more about how these models perform on MMLU-Pro.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenAI o1-preview and o1-mini #21

OpenAI o1-preview and o1-mini #21

EwoutH commented Sep 17, 2024

billbradley commented Oct 26, 2024

billbradley commented Oct 28, 2024

wenhuchen commented Oct 28, 2024

EwoutH commented Dec 5, 2024

EwoutH commented Dec 17, 2024

wenhuchen commented Dec 18, 2024

billbradley commented Dec 18, 2024

EwoutH commented Dec 18, 2024

billbradley commented Jan 18, 2025

OpenAI o1-preview and o1-mini #21

OpenAI o1-preview and o1-mini #21

Comments

EwoutH commented Sep 17, 2024

billbradley commented Oct 26, 2024

billbradley commented Oct 28, 2024

wenhuchen commented Oct 28, 2024

EwoutH commented Dec 5, 2024

EwoutH commented Dec 17, 2024

wenhuchen commented Dec 18, 2024

billbradley commented Dec 18, 2024

EwoutH commented Dec 18, 2024

billbradley commented Jan 18, 2025