HAL: The Holistic Agent Leaderboard for Reproducible Agent Evaluation

This repository provides a standardized evaluation harness for evaluating different AI agents across benchmarks. It supports several benchmarks and allows users to add new agents and benchmarks. The unified CLI allows evaluation across all benchmarks and agents. The harness integrates with Weave for logging and cost tracking, and the official Holistic Agent Leaderboard (HAL) for sharing evaluation results.

Features

Unified hal-eval CLI across all benchmarks and agent types
- HAL supports SWE-bench Verified, USACO, AppWorld, CORE-bench, AgentHarm, GAIA, Cybench, with support for more coming soon
- Run your and existing agents on HAL with the same CLI across benchmarks (see How Do I Run Evaluations?)
Evaluations locally or in the cloud AND fully parallelized
- Local execution with conda environment isolation
- Azure VM support for running evaluations in the cloud
- Configurable concurrency for parallel evaluation
Automatic logging and monitoring
- Integration with Weave for detailed cost tracking and usage metrics
- Automatic logging of agent traces
No constraints on agent implementation or agent framework
- No constraints on specific agent implementation or agent framework (see How Do I Develop My Own Agents?)
- Support for both custom agents and Inspect AI solvers
- Flexible agent configuration through command-line arguments in hal-eval
Share and access agent traces
- Simple upload of agent traces to HuggingFace Hub via hal-upload CLI (see How Can I Submit My Results to the HAL Leaderboards?)
- Automatic encryption of agent traces before uploading to avoid benchmark contamination
HAL leaderboard integration
- Direct integration with the Holistic Agent Leaderboard (HAL)
- Detailed metrics and in-depth performance analysis

Setup

Clone the repository:

git clone --recursive https://github.com/benediktstroebl/hal-harness.git
cd hal-harness

Create conda environment:

conda create -n hal python=3.11
conda activate hal

Install the hal package:
```
pip install -e .
```
Create a .env file:
```
cp .env.template .env
```
Add your API keys (HuggingFace, Weave, OpenAI/other LLMs as needed) to the .env file. See .env.template for details.
Install Model Provider Dependencies:

For Inspect AI benchmarks, you'll need to install the appropriate Python SDK for your chosen model provider:
```
# For OpenAI models
pip install openai

# For Anthropic models
pip install anthropic
```

Optional: Azure VM Setup If you plan to use Azure VMs for evaluation, add the following to your .env:

AZURE_SUBSCRIPTION_ID=your_subscription_id
AZURE_RESOURCE_GROUP_NAME=your_resource_group
AZURE_LOCATION=your_location
SSH_PUBLIC_KEY_PATH=/path/to/your/ssh/key.pub
SSH_PRIVATE_KEY_PATH=/path/to/your/ssh/key
NETWORK_SECURITY_GROUP_NAME=your_nsg_name

Which Benchmarks Are Supported?

SWE-bench Verified (Mini)

Evaluates code generation and bug fixing capabilities
Full dataset (swebench_verified) or mini version (swebench_verified_mini)
Mini version is a subset of 50 randomly selected problems from the full dataset
Supports both local and VM execution
The task ids part of SWE-Bench Verified (Mini) can be found here

USACO

Programming competition problems
Supports both local and VM execution

For USACO, you will need to download and extract the USACO dataset. This can be done with the following steps:

Download the USACO dataset from here
Unzip the dataset and move the data directory to hal/benchmarks/USACO/. Hence there should be a data/ directory in hal/benchmarks/USACO/

AppWorld

A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
Requires VM execution (--vm flag mandatory)

CORE-bench

Computational reproducibility benchmark for agents on real scientific papers
Supports fully parallelized evaluation on Azure VMs
For detailed instructions on running CORE-bench evaluations, see the CORE-bench repository

Inspect AI Benchmarks

Supports a number of Inspect AI agent tasks (inspect_evals/<task_name>)
Two agent types supported:
1. Inspect Solver agents (using @solver decorator)
2. Custom external agents
Inspect solvers are run locally by default with orchestration being done by inspect_ai. Custom agents are run using the harness and can be run either locally or on Azure VMs via the --vm flag.

Gaia

General AI assistants benchmark
More details on Inspect AI implementation here

Cybench

Cybersecurity agent task
Does not support arm64 machines
More details on Inspect AI implementation here
Additional Docker Configuration required for Cybench:

For Cybench, you'll need to configure Docker's default address pools to avoid IP address conflicts when running the harness. Follow these steps:

Edit or create the daemon.json file:
```
sudo nano /etc/docker/daemon.json
```

Add or modify the default-address-pools configuration. For example:

{
  "default-address-pools": [
    {
      "base": "172.17.0.0/16",
      "size": 24
    },
    {
      "base": "172.18.0.0/16",
      "size": 24
    },
    {
      "base": "172.19.0.0/16",
      "size": 24
    }
  ]
}

Save the file and restart the Docker daemon:
```
sudo systemctl restart docker
```

Now the harness should be able to run Cybench.

AgentHarm

Benchmark for evaluating agent behavior on both benign and potentially harmful tasks
Two variants available:
- inspect_evals/agentharm: Evaluates agent behavior on potentially harmful tasks
- inspect_evals/agentharm_benign: Evaluates agent behavior on benign tasks
When using the default inspect agent with benign tasks, requires setting -A task_name=benign
Example usage:

# For benign tasks
hal-eval --benchmark inspect_evals/agentharm_benign \
  --agent_dir agents/inspect/agentharm \
  --agent_function agentharm.default_agent \
  --agent_name "Agent (gpt-4o-mini-2024-07-18)" \
  -A model_name=openai/gpt-4o-mini-2024-07-18 \
  -A task_name=benign

# For potentially harmful tasks
hal-eval --benchmark inspect_evals/agentharm \
  --agent_dir agents/inspect/agentharm \
  --agent_function agentharm.default_agent \
  --agent_name "Agent (gpt-4o-mini-2024-07-18)" \
  -A model_name=openai/gpt-4o-mini-2024-07-18

How Do I Run Evaluations?

The harness uses a command-line interface (CLI) to run evaluations. The basic command structure is:

hal-eval --benchmark <benchmark_name> --agent_dir <agent_directory> --agent_function <agent_function> --agent_name <agent_name> [OPTIONS]

Core Options

--benchmark <benchmark_name>: The name of the benchmark to run. Supported benchmarks:
- swebench_verified: Full SWE-bench Verified dataset
- swebench_verified_mini: Mini version with 50 randomly selected problems
- usaco: USACO programming competition problems
- appworld_test_normal: AppWorld normal test suite
- appworld_test_challenge: AppWorld challenge test suite
- inspect_evals/gaia: Gaia general AI assistants benchmark
- inspect_evals/cybench: Cybersecurity agent tasks
- inspect_evals/agentharm: AgentHarm
- inspect_evals/agentharm_benign: AgentHarm benign evaluation
--agent_dir <agent_directory>: Path to the directory containing your agent's code
--agent_function <agent_function>: The name of the agent's main function (e.g., agent.run if agent.py in agent directory contains def run(): ...)
--agent_name <agent_name>: A descriptive name for your agent (used in logging/leaderboard) (e.g., My Agent (gpt-4o))

Additional Options

-A <key>=<value>: Agent arguments passed to your agent function
-B <key>=<value>: Benchmark arguments passed to the benchmark
-I <key>=<value>: Inspect-specific arguments (for Inspect AI benchmarks)
--upload: Upload results to HuggingFace Hub
--max_concurrent <number>: Number of parallel tasks (default: 1)
--conda_env_name <env_name>: Conda environment for agent execution
--vm: Run evaluation on Azure VMs
--run_id <run_id>: Specify a run ID (useful for continuing runs)
--continue_run: Continue from a previous run (requires run_id)

Example Evaluations

Running SWE-bench locally:

hal-eval --benchmark swebench_verified_mini \
  --agent_dir agents/swebench_example_agent/ \
  --agent_function main.run \
  --agent_name "My Agent (gpt-4o-mini)" \
  -A model_name=gpt-4o-mini \
  --max_concurrent 5

Running USACO on Azure VM:

hal-eval --benchmark usaco \
  --agent_dir agents/usaco_example_agent/ \
  --agent_function main.run \
  --agent_name "USACO Solver (gpt-4o)" \
  --vm \
  --max_concurrent 5 \
  -A model_name=gpt-4o

Running USACO with Amazon Bedrock models:

hal-eval --benchmark usaco \
  --agent_dir agents/usaco_bedrock_models/ \
  --agent_function main.run \
  --agent_name "USACO Solver (Claude 3.5 Sonnet)" \
  -A model_name=bedrock/us.anthropic.claude-3-5-sonnet-20241022-v2:0 \
  -A prompt_template_path=agents/usaco_bedrock_models/prompt_templates/claude.txt \
  --max_concurrent 10

More details on how to run the Amazon Bedrock models can be found here.

Available Bedrock models and their corresponding prompt templates:

Model Name	Model ID	Prompt Template
Claude 3.5 Haiku	bedrock/us.anthropic.claude-3-5-haiku-20241022-v1:0	claude.txt
Claude 3.5 Sonnet	bedrock/us.anthropic.claude-3-5-sonnet-20241022-v2:0	claude.txt
Claude 3 Sonnet	bedrock/us.anthropic.claude-3-sonnet-20240229-v1:0	claude.txt
Amazon Nova Pro	bedrock/amazon.nova-pro-v1:0	nova.txt
Amazon Nova Lite	bedrock/amazon.nova-lite-v1:0	nova.txt
Amazon Nova Micro	bedrock/amazon.nova-micro-v1:0	nova.txt
Llama-3.3 70B	bedrock/us.meta.llama3-3-70b-instruct-v1:0	llama.txt

Running Inspect AI benchmark:

hal-eval --benchmark inspect_evals/gaia \
  --agent_dir agents/inspect/ \
  --agent_function gaia.default_agent \
  --agent_name "Gaia Agent (gpt-4o-mini-2024-07-18)" \
  -A model_name=openai/gpt-4o-mini-2024-07-18 \
  -I token_limit=4000 \
  -I temperature=0.4

Agent Naming Guidelines

Agent names should follow this format: Name (model1, model2). For example:

My Agent (gpt-4-0125-preview)
SWE-agent (claude-3.5-sonnet-20241022-v2)
Multi-Model Agent (gpt-4o-mini-2024-07-18, claude-3.5-sonnet-20241022-v2)

Guidelines:

Include exact model versions
Put models in parentheses
Separate multiple models with commas
Keep names concise
Don't include benchmark names

How to Reproduce Existing Agents on HAL?

See agents/RUN_AGENTS.md for detailed instructions on how to run existing agents across different benchmarks.

Note: We are actively working on adding support for more agents to enable easy reproduction of benchmark results. Currently, we support agents outlined in agents/RUN_AGENTS.md.

How Do I Develop My Own Agents?

See agents/README.md for details.

How Do I Add a Benchmark?

See hal/benchmarks/README.md for details.

How Can I Submit My Results to the HAL Leaderboards?

Results can be uploaded to the Holistic Agent Leaderboard (HAL) in several ways. To avoid benchmark contamination, we automatically encrypt the results before uploading.

During Evaluation:

hal-eval --benchmark <benchmark> ... --upload

After Evaluation:

# Upload all results for a benchmark
hal-upload -B <benchmark_name>

# Upload a single file
hal-upload -F path/to/file.json

# Upload all files in a directory
hal-upload -D path/to/directory

Note: When using -F to upload a single file, the file must be a JSON file.

About HAL

The current landscape of AI agent evaluation faces several critical challenges. Benchmark evaluations tend to focus on accuracy while ignoring costs, leading to uninformative evaluations for downstream developers. What does it mean if an agent has 1% higher accuracy on a benchmark but is 10x more expensive? The lack of standardized evaluation practices makes it difficult to assess real-world capabilities and prevents meaningful comparisons between different approaches. As shown in "AI Agents That Matter" (arXiv:2407.01502), these issues have led to confusion about which advances actually improve performance.

HAL addresses these challenges through two key components: 1) A central leaderboard platform that incorporates cost-controlled evaluations by default, providing clear insights into the cost-performance tradeoffs of different agents, and 2) A standardized evaluation harness that enables reproducible agent evaluations across various benchmarks while tracking token usage and traces and without requiring any changes to the agent code or constraining agent developers to follow a certain agent framework. Evaluations can be run locally or in the cloud and fully parallelized.

TLDR: We aim to standardize AI agent evaluations by providing a third-party platform for comparing agents across various benchmarks. Our goal with HAL is to serve as a one-stop shop for agent evaluations, taking into account both accuracy and cost by default. The accompanying HAL harness offers a simple and scalable way to run agent evals - locally or in the cloud.

Repository Structure

hal/: Core harness code
- benchmarks/: Benchmark implementations
  - swebench.py: SWE-bench implementation
  - usaco.py: USACO implementation
  - mlagentbench.py: MLAgentBench implementation
  - appworld.py: AppWorld implementation
  - inspect_benchmark.py: Inspect AI benchmark support
- utils/: Utility functions
  - local_runner.py: Local execution support
  - vm_runner.py: Azure VM execution support
  - weave_utils.py: Weave logging utilities
- inspect/: Inspect AI specific code
agents/: Example agent implementations
results/: Evaluation results and logs

Name		Name	Last commit message	Last commit date
Latest commit History 246 Commits
agents		agents
hal		hal
tests		tests
.env.template		.env.template
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HAL: The Holistic Agent Leaderboard for Reproducible Agent Evaluation

Features

Table of Contents

Setup

Which Benchmarks Are Supported?

SWE-bench Verified (Mini)

USACO

AppWorld

CORE-bench

Inspect AI Benchmarks

Gaia

Cybench

AgentHarm

How Do I Run Evaluations?

Core Options

Additional Options

Example Evaluations

Agent Naming Guidelines

How to Reproduce Existing Agents on HAL?

How Do I Develop My Own Agents?

How Do I Add a Benchmark?

How Can I Submit My Results to the HAL Leaderboards?

About HAL

Repository Structure

About

Releases

Packages

Languages

aarora79/hal-harness

Folders and files

Latest commit

History

Repository files navigation

HAL: The Holistic Agent Leaderboard for Reproducible Agent Evaluation

Features

Table of Contents

Setup

Which Benchmarks Are Supported?

How Do I Run Evaluations?

Core Options

Additional Options

Example Evaluations

Agent Naming Guidelines

How to Reproduce Existing Agents on HAL?

How Do I Develop My Own Agents?

How Do I Add a Benchmark?

How Can I Submit My Results to the HAL Leaderboards?

About HAL

Repository Structure

About

Resources

Stars

Watchers

Forks

Languages