Name		Name	Last commit message	Last commit date
parent directory ..
scripts		scripts
tests		tests
README.md		README.md
__init__.py		__init__.py
run_infer.py		run_infer.py

README.md

Integration tests

This directory implements integration tests that was running in CI.

PR 3985 introduce LLM-based editing, which requires access to LLM to perform edit. Hence, we remove integration tests from CI and intend to run them as nightly evaluation to ensure the quality of OpenHands softwares.

To add new tests

Each test is a file named like tXX_testname.py where XX is a number. Make sure to name the file for each test to start with t and ends with .py.

Each test should be structured as a subclass of BaseIntegrationTest, where you need to implement initialize_runtime that setup the runtime enviornment before test, and verify_result that takes in a Runtime and history of Event and return a TestResult. See t01_fix_simple_typo.py and t05_simple_browsing.py for two representative examples.

class TestResult(BaseModel):
    success: bool
    reason: str | None = None


class BaseIntegrationTest(ABC):
    """Base class for integration tests."""

    INSTRUCTION: str

    @classmethod
    @abstractmethod
    def initialize_runtime(cls, runtime: Runtime) -> None:
        """Initialize the runtime for the test to run."""
        pass

    @classmethod
    @abstractmethod
    def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
        """Verify the result of the test.

        This method will be called after the agent performs the task on the runtime.
        """
        pass

Setup Environment and LLM Configuration

Please follow instruction here to setup your local development environment and LLM.

Start the evaluation

./evaluation/integration_tests/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [eval-num-workers] [eval_ids]

model_config, e.g. eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml.
git-version, e.g. HEAD, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like 0.9.0.
agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting to CodeActAgent.
eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances. By default, the script evaluates the entire Exercism test set (133 issues). Note: in order to use eval_limit, you must also set agent.
eval-num-workers: the number of workers to use for evaluation. Default: 1.
eval_ids, e.g. "1,3,10", limits the evaluation to instances with the given IDs (comma separated).

Example:

./evaluation/integration_tests/scripts/run_infer.sh llm.claude-35-sonnet-eval HEAD CodeActAgent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integration_tests

integration_tests

README.md

Integration tests

To add new tests

Setup Environment and LLM Configuration

Start the evaluation

Files

integration_tests

Directory actions

More options

Directory actions

More options

Latest commit

History

integration_tests

Folders and files

parent directory

README.md

Integration tests

To add new tests

Setup Environment and LLM Configuration

Start the evaluation