NextBench

NextBench is a collection of wide variety of benchmarks for accessing the performance of LLMs and VLMs and more.

This project aims to make it easy to run IMPORTANT benchmarks across multiple clients/sdks/providers. This is powered by W&B Weave.

If you would like to see a benchmark/client/sdk/provider added, please open an issue.

This is a work in progress but we will soon be opening it up for public contributions.

Benchmarks

We have used the following benchmarks for NextBench. The list is not exhaustive and we will be adding more benchmarks in the future.

Each benchmark must have a question and answer column. If the original benchmark doesn't have the correct column names, we rename them as a post-processing step and upload the benchmark as a W&B Weave Dataset. This allows us to consume the benchmarks in a consisten way and have better control over the versions of the benchmark especially for benchmarks that are updated frequently (MixEval).

Run Evaluation

An example evaluation command is shown below:

python eval.py --model-name gpt-4o-mini --scenario math500 --num-samples 100 --no-enable-cache

The following will load the MATH500 and MMLU-Pro datasets and run the evaluation using the exact match metric.

TODOs

Citations

@misc{wang2024mmluprorobustchallengingmultitask,
      title={MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark}, 
      author={Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen},
      year={2024},
      eprint={2406.01574},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.01574}, 
}

@misc{rein2023gpqagraduatelevelgoogleproofqa,
      title={GPQA: A Graduate-Level Google-Proof Q&A Benchmark}, 
      author={David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman},
      year={2023},
      eprint={2311.12022},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2311.12022}, 
}

@misc{jain2024livecodebenchholisticcontaminationfree,
      title={LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code}, 
      author={Naman Jain and King Han and Alex Gu and Wen-Ding Li and Fanjia Yan and Tianjun Zhang and Sida Wang and Armando Solar-Lezama and Koushik Sen and Ion Stoica},
      year={2024},
      eprint={2403.07974},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2403.07974}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
src/nextbench		src/nextbench
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NextBench

Benchmarks

Run Evaluation

TODOs

About

Releases

Packages

Languages

License

ayulockin/NextBench

Folders and files

Latest commit

History

Repository files navigation

NextBench

Benchmarks

Run Evaluation

TODOs

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages