agent doc

initmiss · Jul 1, 2024 · 0f58762 · 0f58762
1 parent ef519f1
commit 0f58762
Show file tree

Hide file tree

Showing 17 changed files with 686 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -28,7 +28,7 @@
 - <b>AppAgent 👾</b>, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application. 
 - <b>Control Interaction 🎮</b>, is tasked with translating actions from HostAgent and AppAgent into interactions with the application and its UI controls. It's essential that the targeted controls are compatible with the Windows **UI Automation** or **Win32** API.
 
-Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our [technical report](https://arxiv.org/abs/2402.07939).
+Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our [technical report](https://arxiv.org/abs/2402.07939) and [Documentation](https://microsoft.github.io/UFO/).
 <h1 align="center">
     <img src="./assets/framework_v2.png"/> 
 </h1>

diff --git a/documents/docs/agents/app_agent.md b/documents/docs/agents/app_agent.md
@@ -0,0 +1,151 @@
+# AppAgent 👾
+
+An `AppAgent` is responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application. The `AppAgent` is created by the `HostAgent` to fulfill a sub-task within a `Round`. The `AppAgent` is responsible for executing the necessary actions within the application to fulfill the user's request. The `AppAgent` has the following features:
+
+1. **[ReAct](https://arxiv.org/abs/2210.03629) with the Application** - The `AppAgent` recursively interacts with the application in a workflow of observation->thought->action, leveraging the multi-modal capabilities of Visual Language Models (VLMs) to comprehend the application UI and fulfill the user's request.
+2. **Comprehension Enhancement** - The `AppAgent` is enhanced by Retrieval Augmented Generation (RAG) from heterogeneous sources, including external knowledge bases, and demonstration libraries, making the agent an application "expert".
+3. **Versatile Skill Set** - The `AppAgent` is equipped with a diverse set of skills to support comprehensive automation, such as mouse, keyboard, native APIs, and "Copilot".
+
+We show the framework of the `AppAgent` in the following diagram:
+
+<h1 align="center">
+    <img src="../../img/appagent.png" alt="AppAgent Image" width="80%">
+</h1>
+
+## AppAgent Input
+
+To interact with the application, the `AppAgent` receives the following inputs:
+
+| Input | Description | Type |
+| --- | --- | --- |
+| User Request | The user's request in natural language. | String |
+| Sub-Task | The sub-task description to be executed by the `AppAgent`, assigned by the `HostAgent`. | String |
+| Current Application | The name of the application to be interacted with. | String |
+| Control Information | Index, name and control type of available controls in the application. | List of Dictionaries |
+| Application Screenshots | Screenshots of the application to provide context to the `AppAgent`. | Image |
+| Previous Sub-Tasks | The previous sub-tasks and their completion status. | List of Strings |
+| Previous Plan | The previous plan for the following steps. | List of Strings |
+| HostAgent Message | The message from the `HostAgent` for the completion of the sub-task. | String |
+| Retrived Information | The retrieved information from external knowledge bases or demonstration libraries. | String |
+| Blackboard | The shared memory space for storing and sharing information among the agents. | Dictionary |
+
+By processing these inputs, the `AppAgent` determines the necessary actions to fulfill the user's request within the application.
+
+## AppAgent Output
+
+With the inputs provided, the `AppAgent` generates the following outputs:
+
+| Output | Description | Type |
+| --- | --- | --- |
+| Observation | The observation of the current application screenshots. | String |
+| Thought | The logical reasoning process of the `AppAgent`. | String |
+| ControlLabel | The index of the selected control to interact with. | String |
+| ControlText | The name of the selected control to interact with. | String |
+| Function | The function to be executed on the selected control. | String |
+| Args | The arguments required for the function execution. | List of Strings |
+| Status | The status of the agent, mapped to the `AgentState`. | String |
+| Plan | The plan for the following steps after the current action. | List of Strings |
+| Comment | Additional comments or information provided to the user. | String |
+| SaveScreenshot | The flag to save the screenshot of the application to the `blackboard` for future reference. | Boolean |
+
+Below is an example of the `AppAgent` output:
+
+```json
+{
+    "Observation": "Application screenshot",
+    "Thought": "Logical reasoning process",
+    "ControlLabel": "Control index",
+    "ControlText": "Control name",
+    "Function": "Function name",
+    "Args": ["arg1", "arg2"],
+    "Status": "AgentState",
+    "Plan": ["Step 1", "Step 2"],
+    "Comment": "Additional comments",
+    "SaveScreenshot": true
+}
+```
+
+!!! info
+    The `AppAgent` output is formatted as a JSON object by LLMs and can be parsed by the `json.loads` method in Python.
+
+
+## AppAgent State
+The `AppAgent` state is managed by a state machine that determines the next action to be executed based on the current state, as defined in the `ufo/agents/states/app_agent_states.py` module. The states include:
+
+| State | Description |
+| --- | --- |
+| `CONTINUE` | The `AppAgent` continues executing the current action. |
+| `FINISH` | The `AppAgent` has completed the current sub-task. |
+| `ERROR` | The `AppAgent` encountered an error during execution. |
+| `FAIL` | The `AppAgent` believes the current sub-task is unachievable. |
+| `PENDING` | The `AppAgent` is waiting for user input or external information to proceed. |
+| `CONFIRM` | The `AppAgent` is confirming the user's input or action. |
+| `SCREENSHOT` | The `AppAgent` believes the current screenshot is not clear in annotating the control and requests a new screenshot. |
+
+The `AppAgent` progresses through these states to execute the necessary actions within the application and fulfill the sub-task assigned by the `HostAgent`.
+
+
+## Knowledge Enhancement
+The `AppAgent` is enhanced by Retrieval Augmented Generation (RAG) from heterogeneous sources, including external knowledge bases and demonstration libraries. The `AppAgent` leverages this knowledge to enhance its comprehension of the application and learn from demonstrations to improve its performance.
+
+### Learning from Help Documents
+User can provide help documents to the `AppAgent` to enhance its comprehension of the application and improve its performance in the `config.yaml` file. 
+
+!!! tip
+    Please find details configuration in the [documentation](../configurations/user_configuration.md). 
+!!! tip
+    You may also refer to the [here]() for how to provide help documents to the `AppAgent`.
+
+
+In the `AppAgent`, it calls the `build_offline_docs_retriever` to build a help document retriever, and uses the `retrived_documents_prompt_helper` to contruct the prompt for the `AppAgent`.
+
+
+
+### Learning from Bing Search
+Since help documents may not cover all the information or the information may be outdated, the `AppAgent` can also leverage Bing search to retrieve the latest information. You can activate Bing search and configure the search engine in the `config.yaml` file.
+
+!!! tip
+    Please find details configuration in the [documentation](../configurations/user_configuration.md).
+!!! tip
+    You may also refer to the [here]() for the implementation of Bing search in the `AppAgent`.
+
+In the `AppAgent`, it calls the `build_online_search_retriever` to build a Bing search retriever, and uses the `retrived_documents_prompt_helper` to contruct the prompt for the `AppAgent`.
+
+
+### Learning from Self-Demonstrations
+You may save successful action trajectories in the `AppAgent` to learn from self-demonstrations and improve its performance. After the completion of a `session`, the `AppAgent` will ask the user whether to save the action trajectories for future reference. You may configure the use of self-demonstrations in the `config.yaml` file.
+
+!!! tip
+     You can find details of the configuration in the [documentation](../configurations/user_configuration.md).
+
+!!! tip
+    You may also refer to the [here]() for the implementation of self-demonstrations in the `AppAgent`.
+
+In the `AppAgent`, it calls the `build_experience_retriever` to build a self-demonstration retriever, and uses the `rag_experience_retrieve` to retrieve the demonstration for the `AppAgent`.
+
+### Learning from Human Demonstrations
+In addition to self-demonstrations, you can also provide human demonstrations to the `AppAgent` to enhance its performance by using the [Step Recorder](https://support.microsoft.com/en-us/windows/record-steps-to-reproduce-a-problem-46582a9b-620f-2e36-00c9-04e25d784e47) tool built in the Windows OS. The `AppAgent` will learn from the human demonstrations to improve its performance and achieve better personalization. The use of human demonstrations can be configured in the `config.yaml` file.
+
+!!! tip
+    You can find details of the configuration in the [documentation](../configurations/user_configuration.md).
+!!! tip
+    You may also refer to the [here]() for the implementation of human demonstrations in the `AppAgent`.
+
+In the `AppAgent`, it calls the `build_human_demonstration_retriever` to build a human demonstration retriever, and uses the `rag_experience_retrieve` to retrieve the demonstration for the `AppAgent`.
+
+
+## Skill Set for Automation
+The `AppAgent` is equipped with a versatile skill set to support comprehensive automation within the application by calling the `create_puppteer_interface` method. The skills include:
+
+| Skill | Description |
+| --- | --- |
+| UI Automation | Mimicking user interactions with the application UI controls using the `UI Automation` and `Win32` API. |
+| Native API | Accessing the application's native API to execute specific functions and actions. |
+| In-App Agent | Leveraging the in-app agent to interact with the application's internal functions and features. |
+
+By utilizing these skills, the `AppAgent` can efficiently interact with the application and fulfill the user's request. You can find more details in the [Automator](../automator/overview.md) documentation and the code in the `ufo/automator` module.
+
+
+# Reference
+
+:::agents.agent.app_agent.AppAgent
diff --git a/documents/docs/agents/design/blackboard.md b/documents/docs/agents/design/blackboard.md
@@ -0,0 +1,55 @@
+# Agent Blackboard
+
+The `Blackboard` is a shared memory space that is visible to all agents in the UFO framework. It stores information required for agents to interact with the user and applications at every step. The `Blackboard` is a key component of the UFO framework, enabling agents to share information and collaborate to fulfill user requests. The `Blackboard` is implemented as a class in the `ufo/agents/memory/blackboard.py` file.
+
+## Components
+
+The `Blackboard` consists of the following data components:
+
+| Component | Description |
+| --- | --- |
+| `questions` | A list of questions that UFO asks the user, along with their corresponding answers. |
+| `requests` | A list of historical user requests received in previous `Round`. |
+| `trajectories` | A list of step-wise trajectories that record the agent's actions and decisions at each step. |
+| `screenshots` | A list of screenshots taken by the agent when it believes the current state is important for future reference. |
+
+!!! tip
+    The keys stored in the `trajectories` are configured as `HISTORY_KEYS` in the `config_dev.yaml` file. You can customize the keys based on your requirements and the agent's logic.
+
+!!! tip
+    Whether to save the screenshots is determined by the `AppAgent`. You can enable or disable screenshot capture by setting the `SCREENSHOT_TO_MEMORY` flag in the `config_dev.yaml` file.
+
+## Blackboard to Prompt
+
+Data in the `Blackboard` is based on the `MemoryItem` class. It has a method `blackboard_to_prompt` that converts the information stored in the `Blackboard` to a string prompt. Agents call this method to construct the prompt for the LLM's inference. The `blackboard_to_prompt` method is defined as follows:
+
+```python
+def blackboard_to_prompt(self) -> List[str]:
+    """
+    Convert the blackboard to a prompt.
+    :return: The prompt.
+    """
+    prefix = [
+        {
+            "type": "text",
+            "text": "[Blackboard:]",
+        }
+    ]
+
+    blackboard_prompt = (
+        prefix
+        + self.texts_to_prompt(self.questions, "[Questions & Answers:]")
+        + self.texts_to_prompt(self.requests, "[Request History:]")
+        + self.texts_to_prompt(self.trajectories, "[Step Trajectories:]")
+        + self.screenshots_to_prompt()
+    )
+
+    return blackboard_prompt
+```
+
+## Reference
+
+:::agents.memory.blackboard.Blackboard
+
+!!!note
+    You can customize the class to tailor the `Blackboard` to your requirements.
diff --git a/documents/docs/agents/design/memory.md b/documents/docs/agents/design/memory.md
@@ -0,0 +1,24 @@
+# Agent Memory
+
+The `Memory` manages the memory of the agent and stores the information required for the agent to interact with the user and applications at every step. Parts of elements in the `Memory` will be visible to the agent for decision-making.
+
+
+## MemoryItem
+A `MemoryItem` is a `dataclass` that represents a single step in the agent's memory. The fields of a `MemoryItem` is flexible and can be customized based on the requirements of the agent. The `MemoryItem` class is defined as follows:
+
+::: agents.memory.memory.MemoryItem
+
+!!!info
+    At each step, an instance of `MemoryItem` is created and stored in the `Memory` to record the information of the agent's interaction with the user and applications. 
+
+
+## Memory
+The `Memory` class is responsible for managing the memory of the agent. It stores a list of `MemoryItem` instances that represent the agent's memory at each step. The `Memory` class is defined as follows:
+
+::: agents.memory.memory.Memory
+
+!!!info
+    Each agent has its own `Memory` instance to store their information.
+
+!!!info
+    Not all information in the `Memory` are provided to the agent for decision-making. The agent can access parts of the memory based on the requirements of the agent's logic.
diff --git a/documents/docs/agents/design/processor.md b/documents/docs/agents/design/processor.md
@@ -0,0 +1,29 @@
+# Agents Processor
+
+The `Processor` is a key component of the agent to process the core logic of the agent to process the user's request. The `Processor` is implemented as a class in the `ufo/agents/processors` folder. Each agent has its own `Processor` class withing the folder.
+
+## Core Process
+Once called, an agent follows a series of steps to process the user's request defined in the `Processor` class by calling the `process` method. The workflow of the `process` is as follows:
+
+| Step | Description | Function |
+| --- | --- | --- |
+| 1 | Print the step information. | `print_step_info` |
+| 2 | Capture the screenshot of the application. | `capture_screenshot` |
+| 3 | Get the control information of the application. | `get_control_info` |
+| 4 | Get the prompt message for the LLM. | `get_prompt_message` |
+| 5 | Generate the response from the LLM. | `get_response` |
+| 6 | Update the cost of the step. | `update_cost` |
+| 7 | Parse the response from the LLM. | `parse_response` |
+| 8 | Execute the action based on the response. | `execute_action` |
+| 9 | Update the memory and blackboard. | `update_memory` |
+| 10 | Update the status of the agent. | `update_status` |
+| 11 | Update the step information. | `update_step` |
+
+At each step, the `Processor` processes the user's request by invoking the corresponding method sequentially to execute the necessary actions.
+
+
+The process may be paused. It can be resumed, based on the agent's logic and the user's request using the `resume` method.
+
+## Reference
+Below is the basic structure of the `Processor` class:
+:::agents.processors.basic.BaseProcessor
diff --git a/documents/docs/agents/design/prompter.md b/documents/docs/agents/design/prompter.md
@@ -0,0 +1,47 @@
+# Agent Prompter
+
+The `Prompter` is a key component of the UFO framework, responsible for constructing prompts for the LLM to generate responses. The `Prompter` is implemented in the `ufo/prompts` folder. Each agent has its own `Prompter` class that defines the structure of the prompt and the information to be fed to the LLM.
+
+## Components
+
+A prompt fed to the LLM usually a list of dictionaries, where each dictionary contains the following keys:
+
+| Key | Description |
+| --- | --- |
+| `role` | The role of the text in the prompt, can be `system`, `user`, or `assistant`. |
+| `content` | The content of the text for the specific role. |
+
+!!!tip
+    You may find the [official documentation](https://help.openai.com/en/articles/7042661-moving-from-completions-to-chat-completions-in-the-openai-api) helpful for constructing the prompt.
+
+In the `__init__` method of the `Prompter` class, you can define the template of the prompt for each component, and the final prompt message is constructed by combining the templates of each component using the `prompt_construction` method.
+
+### System Prompt
+The system prompt use the template configured in the `config_dev.yaml` file for each agent. It usually contains the instructions for the agent's role, action, tips, reponse format, etc.
+You need use the `system_prompt_construction` method to construct the system prompt.
+
+Prompts on the API instructions, and demonstration examples are also included in the system prompt, which are constructed by the `api_prompt_helper` and `examples_prompt_helper` methods respectively. Below is the sub-components of the system prompt:
+
+| Component | Description | Method |
+| --- | --- | --- |
+| `apis` | The API instructions for the agent. | `api_prompt_helper` |
+| `examples` | The demonstration examples for the agent. | `examples_prompt_helper` |
+
+### User Prompt
+The user prompt is constructed based on the information from the agent's observation, external knowledge, and `Blackboard`. You can use the `user_prompt_construction` method to construct the user prompt. Below is the sub-components of the user prompt:
+
+| Component | Description | Method |
+| --- | --- | --- |
+| `observation` | The observation of the agent. | `user_content_construction` |
+| `retrieved_docs` | The knowledge retrieved from the external knowledge base. | `retrived_documents_prompt_helper` |
+| `blackboard` | The information stored in the `Blackboard`. | `blackboard_to_prompt` |
+
+
+# Reference
+You can find the implementation of the `Prompter` in the `ufo/prompts` folder. Below is the basic structure of the `Prompter` class:
+
+:::prompter.basic.BasicPrompter
+
+
+!!!tip
+    You can customize the `Prompter` class to tailor the prompt to your requirements.