TULU3 MATH eval #465

ypwang61 · 2024-11-30T05:08:50Z

Hi, thanks for your great work, but when I try to evaluate 'models--allenai--Llama-3.1-Tulu-3-8B' on MATH dataset using this codebase, the accuracy is just all zero. Is there some format mismatch on the evaluation? (Using the same model, I can obtain ~88% GSM8k exact match rate)

I guess the problem is caused by the stop_strings:

if not args.use_chat_format:
   stop_strings += ["\n"]

comment it can obtain ~45% result

The text was updated successfully, but these errors were encountered:

vwxyzjn · 2024-12-03T18:38:29Z

Hi thanks for the issue. We use https://github.com/allenai/olmes to do evals. Would you like to check it out?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TULU3 MATH eval #465

TULU3 MATH eval #465

ypwang61 commented Nov 30, 2024 •

edited

Loading

vwxyzjn commented Dec 3, 2024

TULU3 MATH eval #465

TULU3 MATH eval #465

Comments

ypwang61 commented Nov 30, 2024 • edited Loading

vwxyzjn commented Dec 3, 2024

ypwang61 commented Nov 30, 2024 •

edited

Loading