Skip to content

Commit

Permalink
agent doc
Browse files Browse the repository at this point in the history
  • Loading branch information
vyokky committed Jul 1, 2024
1 parent ef519f1 commit 0f58762
Show file tree
Hide file tree
Showing 17 changed files with 686 additions and 8 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
- <b>AppAgent 👾</b>, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application.
- <b>Control Interaction 🎮</b>, is tasked with translating actions from HostAgent and AppAgent into interactions with the application and its UI controls. It's essential that the targeted controls are compatible with the Windows **UI Automation** or **Win32** API.

Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our [technical report](https://arxiv.org/abs/2402.07939).
Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our [technical report](https://arxiv.org/abs/2402.07939) and [Documentation](https://microsoft.github.io/UFO/).
<h1 align="center">
<img src="./assets/framework_v2.png"/>
</h1>
Expand Down
151 changes: 151 additions & 0 deletions documents/docs/agents/app_agent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# AppAgent 👾

An `AppAgent` is responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application. The `AppAgent` is created by the `HostAgent` to fulfill a sub-task within a `Round`. The `AppAgent` is responsible for executing the necessary actions within the application to fulfill the user's request. The `AppAgent` has the following features:

1. **[ReAct](https://arxiv.org/abs/2210.03629) with the Application** - The `AppAgent` recursively interacts with the application in a workflow of observation->thought->action, leveraging the multi-modal capabilities of Visual Language Models (VLMs) to comprehend the application UI and fulfill the user's request.
2. **Comprehension Enhancement** - The `AppAgent` is enhanced by Retrieval Augmented Generation (RAG) from heterogeneous sources, including external knowledge bases, and demonstration libraries, making the agent an application "expert".
3. **Versatile Skill Set** - The `AppAgent` is equipped with a diverse set of skills to support comprehensive automation, such as mouse, keyboard, native APIs, and "Copilot".

We show the framework of the `AppAgent` in the following diagram:

<h1 align="center">
<img src="../../img/appagent.png" alt="AppAgent Image" width="80%">
</h1>

## AppAgent Input

To interact with the application, the `AppAgent` receives the following inputs:

| Input | Description | Type |
| --- | --- | --- |
| User Request | The user's request in natural language. | String |
| Sub-Task | The sub-task description to be executed by the `AppAgent`, assigned by the `HostAgent`. | String |
| Current Application | The name of the application to be interacted with. | String |
| Control Information | Index, name and control type of available controls in the application. | List of Dictionaries |
| Application Screenshots | Screenshots of the application to provide context to the `AppAgent`. | Image |
| Previous Sub-Tasks | The previous sub-tasks and their completion status. | List of Strings |
| Previous Plan | The previous plan for the following steps. | List of Strings |
| HostAgent Message | The message from the `HostAgent` for the completion of the sub-task. | String |
| Retrived Information | The retrieved information from external knowledge bases or demonstration libraries. | String |
| Blackboard | The shared memory space for storing and sharing information among the agents. | Dictionary |

By processing these inputs, the `AppAgent` determines the necessary actions to fulfill the user's request within the application.

## AppAgent Output

With the inputs provided, the `AppAgent` generates the following outputs:

| Output | Description | Type |
| --- | --- | --- |
| Observation | The observation of the current application screenshots. | String |
| Thought | The logical reasoning process of the `AppAgent`. | String |
| ControlLabel | The index of the selected control to interact with. | String |
| ControlText | The name of the selected control to interact with. | String |
| Function | The function to be executed on the selected control. | String |
| Args | The arguments required for the function execution. | List of Strings |
| Status | The status of the agent, mapped to the `AgentState`. | String |
| Plan | The plan for the following steps after the current action. | List of Strings |
| Comment | Additional comments or information provided to the user. | String |
| SaveScreenshot | The flag to save the screenshot of the application to the `blackboard` for future reference. | Boolean |

Below is an example of the `AppAgent` output:

```json
{
"Observation": "Application screenshot",
"Thought": "Logical reasoning process",
"ControlLabel": "Control index",
"ControlText": "Control name",
"Function": "Function name",
"Args": ["arg1", "arg2"],
"Status": "AgentState",
"Plan": ["Step 1", "Step 2"],
"Comment": "Additional comments",
"SaveScreenshot": true
}
```

!!! info
The `AppAgent` output is formatted as a JSON object by LLMs and can be parsed by the `json.loads` method in Python.


## AppAgent State
The `AppAgent` state is managed by a state machine that determines the next action to be executed based on the current state, as defined in the `ufo/agents/states/app_agent_states.py` module. The states include:

| State | Description |
| --- | --- |
| `CONTINUE` | The `AppAgent` continues executing the current action. |
| `FINISH` | The `AppAgent` has completed the current sub-task. |
| `ERROR` | The `AppAgent` encountered an error during execution. |
| `FAIL` | The `AppAgent` believes the current sub-task is unachievable. |
| `PENDING` | The `AppAgent` is waiting for user input or external information to proceed. |
| `CONFIRM` | The `AppAgent` is confirming the user's input or action. |
| `SCREENSHOT` | The `AppAgent` believes the current screenshot is not clear in annotating the control and requests a new screenshot. |

The `AppAgent` progresses through these states to execute the necessary actions within the application and fulfill the sub-task assigned by the `HostAgent`.


## Knowledge Enhancement
The `AppAgent` is enhanced by Retrieval Augmented Generation (RAG) from heterogeneous sources, including external knowledge bases and demonstration libraries. The `AppAgent` leverages this knowledge to enhance its comprehension of the application and learn from demonstrations to improve its performance.

### Learning from Help Documents
User can provide help documents to the `AppAgent` to enhance its comprehension of the application and improve its performance in the `config.yaml` file.

!!! tip
Please find details configuration in the [documentation](../configurations/user_configuration.md).
!!! tip
You may also refer to the [here]() for how to provide help documents to the `AppAgent`.


In the `AppAgent`, it calls the `build_offline_docs_retriever` to build a help document retriever, and uses the `retrived_documents_prompt_helper` to contruct the prompt for the `AppAgent`.



### Learning from Bing Search
Since help documents may not cover all the information or the information may be outdated, the `AppAgent` can also leverage Bing search to retrieve the latest information. You can activate Bing search and configure the search engine in the `config.yaml` file.

!!! tip
Please find details configuration in the [documentation](../configurations/user_configuration.md).
!!! tip
You may also refer to the [here]() for the implementation of Bing search in the `AppAgent`.

In the `AppAgent`, it calls the `build_online_search_retriever` to build a Bing search retriever, and uses the `retrived_documents_prompt_helper` to contruct the prompt for the `AppAgent`.


### Learning from Self-Demonstrations
You may save successful action trajectories in the `AppAgent` to learn from self-demonstrations and improve its performance. After the completion of a `session`, the `AppAgent` will ask the user whether to save the action trajectories for future reference. You may configure the use of self-demonstrations in the `config.yaml` file.

!!! tip
You can find details of the configuration in the [documentation](../configurations/user_configuration.md).

!!! tip
You may also refer to the [here]() for the implementation of self-demonstrations in the `AppAgent`.

In the `AppAgent`, it calls the `build_experience_retriever` to build a self-demonstration retriever, and uses the `rag_experience_retrieve` to retrieve the demonstration for the `AppAgent`.

### Learning from Human Demonstrations
In addition to self-demonstrations, you can also provide human demonstrations to the `AppAgent` to enhance its performance by using the [Step Recorder](https://support.microsoft.com/en-us/windows/record-steps-to-reproduce-a-problem-46582a9b-620f-2e36-00c9-04e25d784e47) tool built in the Windows OS. The `AppAgent` will learn from the human demonstrations to improve its performance and achieve better personalization. The use of human demonstrations can be configured in the `config.yaml` file.

!!! tip
You can find details of the configuration in the [documentation](../configurations/user_configuration.md).
!!! tip
You may also refer to the [here]() for the implementation of human demonstrations in the `AppAgent`.

In the `AppAgent`, it calls the `build_human_demonstration_retriever` to build a human demonstration retriever, and uses the `rag_experience_retrieve` to retrieve the demonstration for the `AppAgent`.


## Skill Set for Automation
The `AppAgent` is equipped with a versatile skill set to support comprehensive automation within the application by calling the `create_puppteer_interface` method. The skills include:

| Skill | Description |
| --- | --- |
| UI Automation | Mimicking user interactions with the application UI controls using the `UI Automation` and `Win32` API. |
| Native API | Accessing the application's native API to execute specific functions and actions. |
| In-App Agent | Leveraging the in-app agent to interact with the application's internal functions and features. |

By utilizing these skills, the `AppAgent` can efficiently interact with the application and fulfill the user's request. You can find more details in the [Automator](../automator/overview.md) documentation and the code in the `ufo/automator` module.


# Reference

:::agents.agent.app_agent.AppAgent
55 changes: 55 additions & 0 deletions documents/docs/agents/design/blackboard.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Agent Blackboard

The `Blackboard` is a shared memory space that is visible to all agents in the UFO framework. It stores information required for agents to interact with the user and applications at every step. The `Blackboard` is a key component of the UFO framework, enabling agents to share information and collaborate to fulfill user requests. The `Blackboard` is implemented as a class in the `ufo/agents/memory/blackboard.py` file.

## Components

The `Blackboard` consists of the following data components:

| Component | Description |
| --- | --- |
| `questions` | A list of questions that UFO asks the user, along with their corresponding answers. |
| `requests` | A list of historical user requests received in previous `Round`. |
| `trajectories` | A list of step-wise trajectories that record the agent's actions and decisions at each step. |
| `screenshots` | A list of screenshots taken by the agent when it believes the current state is important for future reference. |

!!! tip
The keys stored in the `trajectories` are configured as `HISTORY_KEYS` in the `config_dev.yaml` file. You can customize the keys based on your requirements and the agent's logic.

!!! tip
Whether to save the screenshots is determined by the `AppAgent`. You can enable or disable screenshot capture by setting the `SCREENSHOT_TO_MEMORY` flag in the `config_dev.yaml` file.

## Blackboard to Prompt

Data in the `Blackboard` is based on the `MemoryItem` class. It has a method `blackboard_to_prompt` that converts the information stored in the `Blackboard` to a string prompt. Agents call this method to construct the prompt for the LLM's inference. The `blackboard_to_prompt` method is defined as follows:

```python
def blackboard_to_prompt(self) -> List[str]:
"""
Convert the blackboard to a prompt.
:return: The prompt.
"""
prefix = [
{
"type": "text",
"text": "[Blackboard:]",
}
]

blackboard_prompt = (
prefix
+ self.texts_to_prompt(self.questions, "[Questions & Answers:]")
+ self.texts_to_prompt(self.requests, "[Request History:]")
+ self.texts_to_prompt(self.trajectories, "[Step Trajectories:]")
+ self.screenshots_to_prompt()
)

return blackboard_prompt
```

## Reference

:::agents.memory.blackboard.Blackboard

!!!note
You can customize the class to tailor the `Blackboard` to your requirements.
24 changes: 24 additions & 0 deletions documents/docs/agents/design/memory.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Agent Memory

The `Memory` manages the memory of the agent and stores the information required for the agent to interact with the user and applications at every step. Parts of elements in the `Memory` will be visible to the agent for decision-making.


## MemoryItem
A `MemoryItem` is a `dataclass` that represents a single step in the agent's memory. The fields of a `MemoryItem` is flexible and can be customized based on the requirements of the agent. The `MemoryItem` class is defined as follows:

::: agents.memory.memory.MemoryItem

!!!info
At each step, an instance of `MemoryItem` is created and stored in the `Memory` to record the information of the agent's interaction with the user and applications.


## Memory
The `Memory` class is responsible for managing the memory of the agent. It stores a list of `MemoryItem` instances that represent the agent's memory at each step. The `Memory` class is defined as follows:

::: agents.memory.memory.Memory

!!!info
Each agent has its own `Memory` instance to store their information.

!!!info
Not all information in the `Memory` are provided to the agent for decision-making. The agent can access parts of the memory based on the requirements of the agent's logic.
29 changes: 29 additions & 0 deletions documents/docs/agents/design/processor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Agents Processor

The `Processor` is a key component of the agent to process the core logic of the agent to process the user's request. The `Processor` is implemented as a class in the `ufo/agents/processors` folder. Each agent has its own `Processor` class withing the folder.

## Core Process
Once called, an agent follows a series of steps to process the user's request defined in the `Processor` class by calling the `process` method. The workflow of the `process` is as follows:

| Step | Description | Function |
| --- | --- | --- |
| 1 | Print the step information. | `print_step_info` |
| 2 | Capture the screenshot of the application. | `capture_screenshot` |
| 3 | Get the control information of the application. | `get_control_info` |
| 4 | Get the prompt message for the LLM. | `get_prompt_message` |
| 5 | Generate the response from the LLM. | `get_response` |
| 6 | Update the cost of the step. | `update_cost` |
| 7 | Parse the response from the LLM. | `parse_response` |
| 8 | Execute the action based on the response. | `execute_action` |
| 9 | Update the memory and blackboard. | `update_memory` |
| 10 | Update the status of the agent. | `update_status` |
| 11 | Update the step information. | `update_step` |

At each step, the `Processor` processes the user's request by invoking the corresponding method sequentially to execute the necessary actions.


The process may be paused. It can be resumed, based on the agent's logic and the user's request using the `resume` method.

## Reference
Below is the basic structure of the `Processor` class:
:::agents.processors.basic.BaseProcessor
47 changes: 47 additions & 0 deletions documents/docs/agents/design/prompter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Agent Prompter

The `Prompter` is a key component of the UFO framework, responsible for constructing prompts for the LLM to generate responses. The `Prompter` is implemented in the `ufo/prompts` folder. Each agent has its own `Prompter` class that defines the structure of the prompt and the information to be fed to the LLM.

## Components

A prompt fed to the LLM usually a list of dictionaries, where each dictionary contains the following keys:

| Key | Description |
| --- | --- |
| `role` | The role of the text in the prompt, can be `system`, `user`, or `assistant`. |
| `content` | The content of the text for the specific role. |

!!!tip
You may find the [official documentation](https://help.openai.com/en/articles/7042661-moving-from-completions-to-chat-completions-in-the-openai-api) helpful for constructing the prompt.

In the `__init__` method of the `Prompter` class, you can define the template of the prompt for each component, and the final prompt message is constructed by combining the templates of each component using the `prompt_construction` method.

### System Prompt
The system prompt use the template configured in the `config_dev.yaml` file for each agent. It usually contains the instructions for the agent's role, action, tips, reponse format, etc.
You need use the `system_prompt_construction` method to construct the system prompt.

Prompts on the API instructions, and demonstration examples are also included in the system prompt, which are constructed by the `api_prompt_helper` and `examples_prompt_helper` methods respectively. Below is the sub-components of the system prompt:

| Component | Description | Method |
| --- | --- | --- |
| `apis` | The API instructions for the agent. | `api_prompt_helper` |
| `examples` | The demonstration examples for the agent. | `examples_prompt_helper` |

### User Prompt
The user prompt is constructed based on the information from the agent's observation, external knowledge, and `Blackboard`. You can use the `user_prompt_construction` method to construct the user prompt. Below is the sub-components of the user prompt:

| Component | Description | Method |
| --- | --- | --- |
| `observation` | The observation of the agent. | `user_content_construction` |
| `retrieved_docs` | The knowledge retrieved from the external knowledge base. | `retrived_documents_prompt_helper` |
| `blackboard` | The information stored in the `Blackboard`. | `blackboard_to_prompt` |


# Reference
You can find the implementation of the `Prompter` in the `ufo/prompts` folder. Below is the basic structure of the `Prompter` class:

:::prompter.basic.BasicPrompter


!!!tip
You can customize the `Prompter` class to tailor the prompt to your requirements.
Loading

0 comments on commit 0f58762

Please sign in to comment.