Name		Name	Last commit message	Last commit date
parent directory ..
models		models
README.md		README.md
evaluator.py		evaluator.py
main.py		main.py
requirements.txt		requirements.txt
run.sh		run.sh
utils.py		utils.py

README.md

MazeBench

Overview

MazeBench is a benchmarking framework for evaluating the spatial reasoning and navigation capabilities of Large Language Models (LLMs). It assesses LLMs' ability to generate valid, step-by-step solutions to maze-solving tasks of varying complexity.

Features

Multiple Model Backends: Support for various model interfaces including HuggingFace Transformers, VLLM, and OpenAI-compatible APIs
Batch Processing: Efficient batch evaluation for optimal performance
Detailed Metrics: Comprehensive metrics including overall accuracy and level-specific performance
Extensible Design: Easy to extend with new model backends or evaluation metrics
Robust Evaluation: Thorough validation of model solutions against maze constraints

Installation

Prerequisites

Python 3.8+
PyTorch 1.10+
HuggingFace Transformers (for HuggingFace models)
VLLM (for VLLM backend)

Setup

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Command Line Interface

MazeBench provides a command-line interface for easy use:

python main.py \
  --model-type <vllm|hf|openai> \
  --model-name <model_name> \
  --batch-size <batch_size>

Examples

Using VLLM backend:

python main.py \
  --model-type vllm \
  --model-name "homebrewltd/AlphaMaze-v0.2-1.5B" \
  --temperature 0.6 \
  --batch-size 8 \
  --tensor-parallel-size 4

Using HuggingFace backend:

python main.py \
  --model-type hf \
  --model-name "homebrewltd/AlphaMaze-v0.2-1.5B" \
  --temperature 0.6 \
  --batch-size 4

Using OpenAI-compatible API:

python main.py \
  --model-type openai \
  --model-name "gpt-4-o3" \
  --api-key "your-api-key" \
  --api-base "https://api.openai.com/v1"

Python API

You can also use MazeBench programmatically:

from evaluator import MazeBenchEvaluator
from models.vllm_model import VLLMModel
# or from models.hf_model import HuggingFaceModel
# or from models.openai_model import OpenAIModel

# Initialize model
model = VLLMModel(
    model_name="meta-llama/Llama-2-70b-hf",
    batch_size=8
)

# Run evaluation
evaluator = MazeBenchEvaluator(model)
results = evaluator.evaluate()

# Save results
evaluator.save_results("results")

Dataset

MazeBench uses the homebrewltd/Maze-Bench-v0.2 dataset from HuggingFace, which contains maze navigation challenges at various difficulty levels.

The dataset includes:

Mazes of different sizes and complexities
Wall configurations that must be navigated around
Origin and target positions
Multiple difficulty levels

Maze Format

Mazes are represented in a structured format using tokens:

<|row-col|><|wall_configuration|><|marker|>

Where:

row-col: Grid position (e.g., 0-0)
wall_configuration: Describes walls (e.g., up_left_wall)
marker: Indicates special cells (e.g., origin, target, blank)

Solution Format

Solutions are sequences of directional moves:

<|up|> <|down|> <|left|> <|right|>

The solution is correct if it navigates from the origin to the target without crossing walls.

Output Format

The evaluation results are saved as a JSON file with the following structure:

{
  "model_name": "model-name",
  "timestamp": "2025-02-25T12:00:00.000000",
  "overall_accuracy": 75.0,
  "level_accuracies": {
    "1": {
      "accuracy": 90.0,
      "correct": 9,
      "total": 10
    },
    "2": {
      "accuracy": 60.0,
      "correct": 6,
      "total": 10
    }
  },
  "detailed_results": [
    {
      "level": "1",
      "prompt": "...",
      "solution": "...",
      "is_correct": true
    }
  ]
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark

benchmark

README.md

MazeBench

Overview

Features

Installation

Prerequisites

Setup

Usage

Command Line Interface

Examples

Using VLLM backend:

Using HuggingFace backend:

Using OpenAI-compatible API:

Python API

Dataset

Maze Format

Solution Format

Output Format

License

Files

benchmark

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmark

Folders and files

parent directory

README.md

MazeBench

Overview

Features

Installation

Prerequisites

Setup

Usage

Command Line Interface

Examples

Using VLLM backend:

Using HuggingFace backend:

Using OpenAI-compatible API:

Python API

Dataset

Maze Format

Solution Format

Output Format

License