Skip to content

NextBench is a collection of wide variety of benchmarks for accessing the performance of LLMs and VLMs and more.

License

Notifications You must be signed in to change notification settings

ayulockin/NextBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NextBench

NextBench is a collection of wide variety of benchmarks for accessing the performance of LLMs and VLMs and more.

This project aims to make it easy to run IMPORTANT benchmarks across multiple clients/sdks/providers. This is powered by W&B Weave.

If you would like to see a benchmark/client/sdk/provider added, please open an issue.

This is a work in progress but we will soon be opening it up for public contributions.

Benchmarks

We have used the following benchmarks for NextBench. The list is not exhaustive and we will be adding more benchmarks in the future.

Each benchmark must have a question and answer column. If the original benchmark doesn't have the correct column names, we rename them as a post-processing step and upload the benchmark as a W&B Weave Dataset. This allows us to consume the benchmarks in a consisten way and have better control over the versions of the benchmark especially for benchmarks that are updated frequently (MixEval).

Run Evaluation

An example evaluation command is shown below:

python eval.py --model-name gpt-4o-mini --scenario math500 --num-samples 100 --no-enable-cache

The following will load the MATH500 and MMLU-Pro datasets and run the evaluation using the exact match metric.

TODOs

  • Scenarios are defined as classes in the src/nextbench/scenarios directory.
  • Caching of results.
  • Caching of datasets (because Weave caching is not working for datasets atm).
  • System prompts are defined as weave.StringPrompt and published to W&B for better tracking.
  • Configurable number of samples to evaluate from the dataset.
  • OpenAI client
  • Package everything as a CLI tool from eval.py file.
  • DSPY prompt optimization.
  • Add more clients (e.g. Groq, Anthropic, Gemini, etc.)
  • Add more scenarios (e.g. GPQA-Diamond, LiveCodeBench, etc.)
  • Make it more configurable and user friendly.
Citations
@misc{wang2024mmluprorobustchallengingmultitask,
      title={MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark}, 
      author={Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen},
      year={2024},
      eprint={2406.01574},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.01574}, 
}
@misc{rein2023gpqagraduatelevelgoogleproofqa,
      title={GPQA: A Graduate-Level Google-Proof Q&A Benchmark}, 
      author={David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman},
      year={2023},
      eprint={2311.12022},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2311.12022}, 
}
@misc{jain2024livecodebenchholisticcontaminationfree,
      title={LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code}, 
      author={Naman Jain and King Han and Alex Gu and Wen-Ding Li and Fanjia Yan and Tianjun Zhang and Sida Wang and Armando Solar-Lezama and Koushik Sen and Ion Stoica},
      year={2024},
      eprint={2403.07974},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2403.07974}, 
}

About

NextBench is a collection of wide variety of benchmarks for accessing the performance of LLMs and VLMs and more.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages