-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenAI o1-preview and o1-mini #21
Comments
I headed to the issues page to post the same request, so instead I will cheer on this one :) In case it may be helpful: as compared with other LLMs, o1-preview is (1) remarkably slow, (2) remarkably expensive and (3) remarkably different (and sometimes better) than any other LLM I've used. I'm extremely interested in finding out how it performs on MMLU-Pro. |
Hmm... This tweet would seem to suggest that the researchers at TIGER have already evaluated We would love to see this data reflected in the GitHub repo! |
We were capped by the request/day before. But we can definitely rerun it now with higher quota. |
Curious if this still can be done, even if only for a historical comparison once the full o1 is out. These are the first "reasoning" models in history, so it would be nice to have them for future reference. |
Thanks for adding all the models in the last few days! Regular o1 is now also available in the API. Might be a nice moment to test all three. |
It's very expensive. We are not sure whether we have the fund to run it. |
You know, if the accuracy for o1 is in the neighborhood of 80%, then with 1600 samples I think the standard deviation should be about:
So, what do you think about evaluating on a random subsample of 1600? The approximate answer might still tell us a lot. |
Would using the Batch API also help? |
Of course, the landscape has changed since this question was first asked (with |
Very curious about OpenAI o1-preview and o1-mini MMLU-Pro scores. Opening this issue as a tracking issue that people can follow and updates can be shared in.
The text was updated successfully, but these errors were encountered: