Testbed folders (microsoft#792)

* Re-added completion logging when using older versions of autogen. * Extended scenario definitions and templating to include folders. * Prepare collate_human_eval.py for working with group chat scenarios. * Converted HumanEval to the folder-based approach, and added GroupChat scenarios. * Fixed the default termination message. * Fixed another termination condition. * Updated compatible autogen versions. * Fixed a bug in executing the finalize scripts. * Generalized the template further to support multiple folder copy operations. * Add tests from AutoGPT. * Update README.md * Fix typo * Update samples/tools/testbed/README.md --------- Co-authored-by: LeoLjl <[email protected]> Co-authored-by: Qingyun Wu <[email protected]>
macksin · Nov 30, 2023 · 45c2a78 · 45c2a78
1 parent ae7066b
commit 45c2a78
Showing 37 changed files with 1,048 additions and 165 deletions.
diff --git a/samples/tools/testbed/README.md b/samples/tools/testbed/README.md
@@ -2,7 +2,7 @@
 
 The Autogen Testbed environment is a tool for repeatedly running a set of pre-defined Autogen scenarios in a setting with tightly-controlled initial conditions. With each run, Autogen will start from a blank slate, working out what code needs to be written, and what libraries or dependencies to install. The results of each run are logged, and can be ingested by analysis or metrics scripts (see the HumanEval example later in this README). By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety.
 
-This Testbed sample has been tested in, and is known to work with, Autogen versions 0.1.14 and 0.2.0b5
+This Testbed sample has been tested in, and is known to work with, Autogen versions 0.1.14 and 0.2.0
 
 ## Setup
 
@@ -17,11 +17,10 @@ The Testbed also requires Docker (Desktop or Engine) AND the __python docker__ l
 ## Running the Testbed
 
 To run the Testbed, simply execute
-``python run_scenarios.py``
-
-The default it to repeat this scenario 10 times. This can be costly. To run each scenario only once, use:
-``python run_scenarios.py --repeat 1``
+``python run_scenarios.py scenarios/Examples``
 
+The default is to run each scenario once time. To run each scenario 10 times, use:
+``python run_scenarios.py --repeat 10 scenarios/Examples ``
 
 The run_scenarios.py script also allows a number of command-line arguments to control various parameters of execution. Type ``python run_scenarios.py -h`` to explore these options:
 
@@ -58,36 +57,37 @@ By default, the Testbed stores results in a folder heirarchy with the following
 
 For example, consider the following folders:
 
-``./results/default_two_agents/two_agent_stocks_gpt4/0``
-``./results/default_two_agents/two_agent_stocks_gpt4/1``
+``./results/default_two_agents_gpt35/two_agent_stocks/0``
+``./results/default_two_agents_gpt35/two_agent_stocks/1``
 
 ...
 
-``./results/default_two_agents/two_agent_stocks_gpt4/9``
+``./results/default_two_agents_gpt35/two_agent_stocks/9``
 
-This folder holds the results for the ``two_agent_stocks_gpt4`` instance of the ``default_two_agents`` scenario. The ``0`` folder contains the results of the first run. The ``1`` folder contains the results of the second run, and so on. You can think of the _instance_ as mapping to a prompt, or a unique set of parameters, while the _scenario_ defines the template in which those parameters are input.
+This folder holds the results for the ``two_agent_stocks`` instance of the ``default_two_agents_gpt35`` scenario. The ``0`` folder contains the results of the first run. The ``1`` folder contains the results of the second run, and so on. You can think of the _instance_ as mapping to a prompt, or a unique set of parameters, while the _scenario_ defines the template in which those parameters are input.
 
 Within each folder, you will find the following files:
 
 - *timestamp.txt*: records the date and time of the run, along with the version of the pyautogen library installed
 - *console_log.txt*: all console output produced by Docker when running autogen. Read this like you would a regular console.
-- *chat_completions.json*: a log of all OpenAI ChatCompletions, as logged by ``autogen.ChatCompletion.start_logging(compact=False)``
+- *chat_completions.json*: a log of all OpenAI ChatCompletions, as logged by `autogen.ChatCompletion.start_logging(compact=False)`
 - *[agent]_messages.json*: for each Agent, a log of their messages dictionaries
 - *./coding*: A directory containing all code written by Autogen, and all artifacts produced by that code.
 
 ## Scenario Templating
 
-All scenarios are stored in JSONL files in the ``./scenarios'' directory. Each line of a scenario file is a JSON object with the following schema:
+All scenarios are stored in JSONL files (in subdirectories under `./scenarios`). Each line of a scenario file is a JSON object. The schema varies slightly based on if "template" specifies a _file_ or a _directory_.
 
+If "template" points to a _file_, the format is:
 ```
 {
    "id": string,
    "template": filename,
-   "values" {
-       "field_name1": string,
-       "field_name2": string,
+   "substitutions" {
+       "find_string1": replace_string1,
+       "find_string2": replace_string2,
        ...
-       "field_nameN": string
+       "find_stringN": replace_stringN
    }
 }
 ```
@@ -98,48 +98,88 @@ For example:
 {
     "id": "two_agent_stocks_gpt4",
     "template": "default_two_agents.py",
-    "values": {
+    "substitutions": {
         "\__MODEL\__": "gpt-4",
         "\__PROMPT\__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD."
     }
 }
 ```
 
-Where the ``id`` is the instance id used when saving results, ``template`` points to a python file that contains the scenario logic, and ``values`` contains a set of strings to find and replace when expanding the template.
 
-An example templated python file is:
+If "template" points to a _directory_, the format is:
 
 ```
-from autogen import AssistantAgent, UserProxyAgent, config_list_from_json
-import os
-import json
-import testbed_utils
-
-testbed_utils.init()
-##############################
-
-config_list = config_list_from_json(
-        "OAI_CONFIG_LIST", filter_dict={"model": ["\__MODEL\__"]},
-)
-
-assistant = AssistantAgent("assistant", llm_config={
-    "request_timeout": 180,
-    "config_list": config_list}
-)
-user_proxy = UserProxyAgent("user_proxy",
-            human_input_mode="NEVER",
-            code_execution_config={
-                "work_dir": "coding",
-                "use_docker": False,
-            },
-            max_consecutive_auto_reply=10)
-user_proxy.initiate_chat(assistant, message="\__PROMPT\__")
-
-
-##############################
-testbed_utils.finalize(assistant, user_proxy)
+{
+   "id": string,
+   "template": dirname,
+   "substitutions" {
+       "filename1": {
+       	   "find_string1_1": replace_string1_1,
+           "find_string1_2": replace_string1_2,
+           ...
+           "find_string1_M": replace_string1_N
+       }
+       "filename2": {
+       	   "find_string2_1": replace_string2_1,
+           "find_string2_2": replace_string2_2,
+           ...
+           "find_string2_N": replace_string2_N
+       }
+   }
+}
+```
+
+For example:
+
+```
+{
+    "id": "two_agent_stocks_gpt4",
+    "template": "default_two_agents",
+    "substitutions": {
+	"scenario.py": {
+            "\__MODEL\__": "gpt-4",
+	},
+	"prompt.txt": {
+            "\__PROMPT\__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD."
+        }
+    }
+}
 ```
 
+In this example, the string `__MODEL__` will be replaced in the file `scenarios.py`, while the string `__PROMPT__` will be replaced in the `prompt.txt` file.
+
+
+## Scenario Expansion Algorithm
+
+When the Testbed runs a scenario, it creates a local folder to share with Docker. As noted above, each instance and repetition gets its own folder along the path: ``./results/[scenario]/[instance_id]/[repetition]``
+
+For the sake of brevity we will refer to this folder as the `DEST_FOLDER`.
+
+The algorithm for populating the `DEST_FOLDER` is as follows:
+
+1. Recursively copy the contents of `./incudes` to DEST_FOLDER. This folder contains all the basic starter files for running a scenario, including an ENV file which will set the Docker environment variables.
+2. Append the OAI_CONFIG_LIST to the ENV file so that autogen may access these secrets.
+3. Recursively copy the scenario folder (if `template` in the json scenario definition points to a folder) to DEST_FOLDER. If the `template` instead points to a file, copy the file, but rename it to `scenario.py`
+4. Apply any templating, as outlined in the prior section.
+5. Write a run.sh file to DEST_FOLDER that will be executed by Docker when it is loaded.
+
+
+## Scenario Execution Algorithm
+
+Once the scenario has been expanded it is run (via run.sh). This script will execute the following steps:
+
+1. Read and set the ENV environment variables
+2. If a file named `global_init.sh` is present, run it.
+3. If a file named `scenario_init.sh` is present, run it.
+4. Install the requirements file (if running in Docker)
+5. Run the Autogen scenario via `python scenario.py`
+6. Clean up (delete cache, etc.)
+7. If a file named `scenario_finalize.sh` is present, run it.
+8. If a file named `global_finalize.sh` is present, run it.
+9. echo "SCENARIO COMPLETE !#!#", signaling that all steps completed.
+
+Notably, this means that scenarios can add custom init and teardown logic by including `scenario_init.sh` and `scenario_finalize.sh` files.
+
 
 ## (Example) Running HumanEval
 
@@ -149,7 +189,7 @@ Accessing this scenario-type requires downloading and converting the HumanEval d
 
 ```
 python utils/download_humaneval.py
-python ./run_scenarios.py --repeat 3 scenarios/human_eval_two_agents_gpt35.jsonl
+python ./run_scenarios.py scenarios/HumanEval/human_eval_two_agents_gpt35.jsonl
 python utils/collate_human_eval.py ./results/human_eval_two_agents_gpt35 | python utils/metrics_human_eval.py > human_eval_results_gpt35.csv
 cat human_eval_results_gpt35.csv
 ```
diff --git a/samples/tools/testbed/includes/global_finalize.sh b/samples/tools/testbed/includes/global_finalize.sh
@@ -0,0 +1 @@
+# Global finalize.
diff --git a/samples/tools/testbed/includes/global_init.sh b/samples/tools/testbed/includes/global_init.sh
@@ -0,0 +1 @@
+echo AUTOGEN_TESTBED_SETTING: [$AUTOGEN_TESTBED_SETTING]
diff --git a/samples/tools/testbed/includes/requirements.txt b/samples/tools/testbed/includes/requirements.txt
@@ -1 +1 @@
-pyautogen
+git+https://github.com/microsoft/autogen.git
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		echo AUTOGEN_TESTBED_SETTING: [$AUTOGEN_TESTBED_SETTING]
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		pyautogen
		git+https://github.com/microsoft/autogen.git