Skip to content
Change the repository type filter

All

    Repositories list

    • GPassK

      Public
      Official Repository of `Are Your LLMs Capable of Stable Reasoning?`
      01300Updated Dec 26, 2024Dec 26, 2024
    • OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
      Python
      Apache License 2.0
      4634.4k22825Updated Dec 25, 2024Dec 25, 2024
    • Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks
      Python
      Apache License 2.0
      2201.5k618Updated Dec 25, 2024Dec 25, 2024
    • Jupyter Notebook
      69840Updated Dec 16, 2024Dec 16, 2024
    • ANAH

      Public
      [ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2
      Python
      Apache License 2.0
      22700Updated Dec 11, 2024Dec 11, 2024
    • [NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
      Python
      Apache License 2.0
      23710Updated Nov 29, 2024Nov 29, 2024
    • 47900Updated Nov 26, 2024Nov 26, 2024
    • GTA

      Public
      [NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
      Python
      Apache License 2.0
      64910Updated Nov 6, 2024Nov 6, 2024
    • ProSA

      Public
      [EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
      Python
      Apache License 2.0
      22000Updated Oct 22, 2024Oct 22, 2024
    • Python
      Apache License 2.0
      1200Updated Sep 23, 2024Sep 23, 2024
    • MMBench

      Public
      Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
      Apache License 2.0
      1017040Updated Sep 1, 2024Sep 1, 2024
    • hinode

      Public
      A clean documentation and blog theme for your Hugo site based on Bootstrap 5
      HTML
      MIT License
      55000Updated Sep 1, 2024Sep 1, 2024
    • storage

      Public
      Apache License 2.0
      0000Updated Aug 18, 2024Aug 18, 2024
    • Demo data of CompassBench
      3420Updated Aug 7, 2024Aug 7, 2024
    • CIBench

      Public
      Official Repo of "CIBench: Evaluation of LLMs as Code Interpreter "
      Python
      Apache License 2.0
      21000Updated Jul 19, 2024Jul 19, 2024
    • MathBench

      Public
      [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
      Apache License 2.0
      18960Updated Jul 12, 2024Jul 12, 2024
    • .github

      Public
      1000Updated May 31, 2024May 31, 2024
    • DevEval

      Public
      A Comprehensive Benchmark for Software Development.
      Python
      Apache License 2.0
      58600Updated May 30, 2024May 30, 2024
    • CodeBench

      Public
      0200Updated May 21, 2024May 21, 2024
    • Ada-LEval

      Public
      The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
      Python
      25100Updated Apr 22, 2024Apr 22, 2024
    • T-Eval

      Public
      [ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
      Python
      Apache License 2.0
      15250342Updated Apr 3, 2024Apr 3, 2024
    • Code for the paper "Evaluating Large Language Models Trained on Code"
      Python
      MIT License
      355200Updated Mar 14, 2024Mar 14, 2024
    • Apache License 2.0
      24230Updated Mar 8, 2024Mar 8, 2024
    • A multi-language code evaluation tool.
      Python
      Apache License 2.0
      81901Updated Jan 26, 2024Jan 26, 2024
    • evalplus

      Public
      EvalPlus for rigourous evaluation of LLM-synthesized code
      Python
      Apache License 2.0
      114100Updated Dec 20, 2023Dec 20, 2023
    • A toolkit for inference and evaluation of 'mixtral-8x7b-32kseqlen' from Mistral AI
      Python
      Apache License 2.0
      81765120Updated Dec 15, 2023Dec 15, 2023
    • LawBench

      Public
      Benchmarking Legal Knowledge of Large Language Models
      Python
      Apache License 2.0
      4427430Updated Nov 13, 2023Nov 13, 2023
    • BotChat

      Public
      Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
      Jupyter Notebook
      Apache License 2.0
      614310Updated Nov 2, 2023Nov 2, 2023
    • Sphinx Theme for OpenCompass - Modified from PyTorch
      CSS
      MIT License
      138000Updated Aug 30, 2023Aug 30, 2023