Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAI o1-preview and o1-mini #21

Open
EwoutH opened this issue Sep 17, 2024 · 9 comments
Open

OpenAI o1-preview and o1-mini #21

EwoutH opened this issue Sep 17, 2024 · 9 comments

Comments

@EwoutH
Copy link

EwoutH commented Sep 17, 2024

Very curious about OpenAI o1-preview and o1-mini MMLU-Pro scores. Opening this issue as a tracking issue that people can follow and updates can be shared in.

@billbradley
Copy link

I headed to the issues page to post the same request, so instead I will cheer on this one :)

In case it may be helpful: as compared with other LLMs, o1-preview is (1) remarkably slow, (2) remarkably expensive and (3) remarkably different (and sometimes better) than any other LLM I've used. I'm extremely interested in finding out how it performs on MMLU-Pro.

@billbradley
Copy link

Hmm... This tweet would seem to suggest that the researchers at TIGER have already evaluated o1-preview last month:
https://x.com/WenhuChen/status/1834605218018754581

We would love to see this data reflected in the GitHub repo!

@wenhuchen
Copy link
Contributor

We were capped by the request/day before. But we can definitely rerun it now with higher quota.

@EwoutH
Copy link
Author

EwoutH commented Dec 5, 2024

Curious if this still can be done, even if only for a historical comparison once the full o1 is out.

These are the first "reasoning" models in history, so it would be nice to have them for future reference.

@EwoutH
Copy link
Author

EwoutH commented Dec 17, 2024

Thanks for adding all the models in the last few days!

Regular o1 is now also available in the API. Might be a nice moment to test all three.

@wenhuchen
Copy link
Contributor

It's very expensive. We are not sure whether we have the fund to run it.

@billbradley
Copy link

You know, if the accuracy for o1 is in the neighborhood of 80%, then with 1600 samples I think the standard deviation should be about:

+/- sqrt(0.8 * 0.2 / 1600) = +/- 1%

So, what do you think about evaluating on a random subsample of 1600? The approximate answer might still tell us a lot.

@EwoutH
Copy link
Author

EwoutH commented Dec 18, 2024

Would using the Batch API also help?

@billbradley
Copy link

Of course, the landscape has changed since this question was first asked (with o1 fairly accessible, o3 in the wings, and other reasoning models in play), but it would still be very interesting to learn a little more about how these models perform on MMLU-Pro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants