Skip to content

Commit

Permalink
Merge pull request microsoft#109 from microsoft/vyokky/dev
Browse files Browse the repository at this point in the history
Vyokky/dev
  • Loading branch information
vyokky authored Jul 6, 2024
2 parents 71509ec + f96901f commit b46cb89
Show file tree
Hide file tree
Showing 3 changed files with 31 additions and 23 deletions.
24 changes: 12 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,29 +26,30 @@
<b>UFO</b> <img src="./assets/ufo_blue.png" alt="UFO Image" width="24"> operates as a multi-agent framework, encompassing:
- <b>HostAgent 🤖</b>, tasked with choosing an application for fulfilling user requests. This agent may also switch to a different application when a request spans multiple applications, and the task is partially completed in the preceding application.
- <b>AppAgent 👾</b>, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application.
- <b>Control Interaction 🎮</b>, is tasked with translating actions from HostAgent and AppAgent into interactions with the application and its UI controls. It's essential that the targeted controls are compatible with the Windows **UI Automation** or **Win32** API.
- <b>Application Automator 🎮</b>, is tasked with translating actions from HostAgent and AppAgent into interactions with the application and through UI controls, native APIs or AI tools. Check out more details [here](https://microsoft.github.io/UFO/automator/overview/).

Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our [technical report](https://arxiv.org/abs/2402.07939) and [Documentation](https://microsoft.github.io/UFO/).
Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our [technical report](https://arxiv.org/abs/2402.07939) and [documentation](https://microsoft.github.io/UFO/).
<h1 align="center">
<img src="./assets/framework_v2.png"/>
</h1>


## 📢 News
- 📅 2024-06-28: We are thrilled to announce that our official introduction video is now available on [YouTube](https://www.youtube.com/watch?v=QT_OhygMVXU)! Additionally, you can check out the early version of our [documentation](https://microsoft.github.io/UFO/). We welcome your contributions and feedback!
- 📅 2024-07-06: We have a **New Release for v1.0.0!**. You can check out our [documentation](https://microsoft.github.io/UFO/). We welcome your contributions and feedback!
- 📅 2024-06-28: We are thrilled to announce that our official introduction video is now available on [YouTube](https://www.youtube.com/watch?v=QT_OhygMVXU)!
- 📅 2024-06-25: **New Release for v0.2.1!** We are excited to announce the release of version 0.2.1! This update includes several new features and improvements:
1. **HostAgent Refactor:** We've refactored the HostAgent to enhance its efficiency in managing AppAgents within UFO.
2. **Evaluation Agent:** Introducing an evaluation agent that assesses task completion and provides real-time feedback.
3. **Google Gemini Support:** UFO now supports Google Gemini as the inference engine. Refer to our detailed guide in [Documentation](https://microsoft.github.io/UFO/supported_models/gemini/).
3. **Google Gemini Support:** UFO now supports Google Gemini as the inference engine. Refer to our detailed guide in [documentation](https://microsoft.github.io/UFO/supported_models/gemini/).
4. **Customized User Agents:** Users can now create customized agents by simply answering a few questions.
- 📅 2024-05-21: We have reached 5K stars!✨
- 📅 2024-05-08: **New Release for v0.1.1!** We've made some significant updates! Previously known as AppAgent and ActAgent, we've rebranded them to HostAgent and AppAgent to better align with their functionalities. Explore the latest enhancements:
1. **Learning from Human Demonstration:** UFO now supports learning from human demonstration! Utilize the [Windows Step Recorder](https://support.microsoft.com/en-us/windows/record-steps-to-reproduce-a-problem-46582a9b-620f-2e36-00c9-04e25d784e47) to record your steps and demonstrate them for UFO. Refer to our detailed guide in [README.md](/record_processor/README.md) for more information.
1. **Learning from Human Demonstration:** UFO now supports learning from human demonstration! Utilize the [Windows Step Recorder](https://support.microsoft.com/en-us/windows/record-steps-to-reproduce-a-problem-46582a9b-620f-2e36-00c9-04e25d784e47) to record your steps and demonstrate them for UFO. Refer to our detailed guide in [README.md](https://microsoft.github.io/UFO/creating_app_agent/demonstration_provision/) for more information.
2. **Win32 Support:** We've incorporated support for [Win32](https://learn.microsoft.com/en-us/windows/win32/controls/window-controls) as a control backend, enhancing our UI automation capabilities.
3. **Extended Application Interaction:** UFO now goes beyond UI controls, allowing interaction with your application through keyboard inputs and native APIs! Presently, we support Word ([examples](/ufo/prompts/apps/word/api.yaml)), with more to come soon. Customize and build your own interactions.
4. **Control Filtering:** Streamline LLM's action process by using control filters to remove irrelevant control items. Enable them in [config_dev.yaml](/ufo/config/config_dev.yaml) under the `control filtering` section at the bottom.
- 📅 2024-03-25: **New Release for v0.0.1!** Check out our exciting new features.
1. We now support creating your help documents for each Windows application to become an app expert. Check the [README](./learner/README.md) for more details!
1. We now support creating your help documents for each Windows application to become an app expert. Check the [README](https://microsoft.github.io/UFO/creating_app_agent/help_document_provision/) for more details!
2. UFO now supports RAG from offline documents and online Bing search.
3. You can save the task completion trajectory into its memory for UFO's reference, improving its future success rate!
4. You can customize different GPT models for AppAgent and ActAgent. Text-only models (e.g., GPT-4) are now supported!
Expand Down Expand Up @@ -99,7 +100,7 @@ pip install -r requirements.txt
```

### ⚙️ Step 2: Configure the LLMs
Before running UFO, you need to provide your LLM configurations **individually for HostAgent and AppAgent**. You can create your own config file `ufo/config/config.yaml`, by copying the `ufo/config/config.yaml.template` and editing config for **APP_AGENT** and **ACTION_AGENT** as follows:
Before running UFO, you need to provide your LLM configurations **individually for HostAgent and AppAgent**. You can create your own config file `ufo/config/config.yaml`, by copying the `ufo/config/config.yaml.template` and editing config for **HOST_AGENT** and **APP_AGENT** as follows:


#### OpenAI
Expand Down Expand Up @@ -140,10 +141,10 @@ UFO also supports other LLMs and advanced configurations, such as customize your
If you want to enhance UFO's ability with external knowledge, you can optionally configure it with an external database for retrieval augmented generation (RAG) in the `ufo/config/config.yaml` file.

We provide the following options for RAG to enhance UFO's capabilities:
- **[Offline Help Document](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/learning_from_help_document/)**: Enable UFO to retrieve information from offline help documents.
- **[Online Bing Search Engine](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/learning_from_bing_search/)**: Enhance UFO's capabilities by utilizing the most up-to-date online search results.
- **[Self-Experience](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/experience_learning/)**: Save task completion trajectories into UFO's memory for future reference.
- **[User-Demonstration](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/learning_from_demonstration/)**: Boost UFO's capabilities through user demonstration.
- [Offline Help Document](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/learning_from_help_document/)* Enable UFO to retrieve information from offline help documents.
- [Online Bing Search Engine](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/learning_from_bing_search/): Enhance UFO's capabilities by utilizing the most up-to-date online search results.
- [Self-Experience](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/experience_learning/): Save task completion trajectories into UFO's memory for future reference.
- [User-Demonstration](https://microsoft.github.io/UFO/advanced_usage/reinforce_appagent/learning_from_demonstration/): Boost UFO's capabilities through user demonstration.

Consult their respective documentation for more information on how to configure these settings.

Expand Down Expand Up @@ -256,7 +257,6 @@ https://github.com/microsoft/UFO/assets/11352048/aa41ad47-fae7-4334-8e0b-ba71c4f




## 📊 Evaluation

Please consult the [WindowsBench](https://arxiv.org/pdf/2402.07939.pdf) provided in Section A of the Appendix within our technical report. Here are some tips (and requirements) to aid in completing your request:
Expand Down
10 changes: 9 additions & 1 deletion documents/docs/automator/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ You can find the reference for a basic `Command` class below:
...


## Invoker
## Invoker (AppPuppeteer)

The `AppPuppeteer` plays the role of the invoker in the Automator application. It triggers the commands to be executed by the receivers. The `AppPuppeteer` equips the `AppAgent` with the capability to interact with the application's UI controls. It provides functionalities to translate action strings into specific actions and execute them. All available actions are registered in the `Puppeteer` with the `ReceiverManager` class.

Expand All @@ -61,4 +61,12 @@ You can find the implementation of the `AppPuppeteer` class in the `ufo/automato

<br>


## Receiver Manager
The `ReceiverManager` manages all the receivers and commands in the Automator application. It provides functionalities to register and retrieve receivers and commands. It is a complementary component to the `AppPuppeteer`.

::: automator.puppeteer.ReceiverManager

<br>

For further details, refer to the specific documentation for each component and class in the Automator module.
20 changes: 10 additions & 10 deletions documents/docs/creating_app_agent/warpping_app_native_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,16 @@ The `Receiver` is a class that receives the native API calls from the `AppAgent`

To create a `Receiver` class, follow these steps:

**1. Create a Folder for Your Application:**
#### 1. Create a Folder for Your Application

- Navigate to the `ufo/automator/app_api/` directory.
- Create a folder named after your application.

**2. Create a Python File:**
#### 2. Create a Python File

- Inside the folder you just created, add a Python file named after your application, for example, `{your_application}_client.py`.

**3. Define the Receiver Class:**
#### 3. Define the Receiver Class

- In the Python file, define a class named `{Your_Receiver}`, inheriting from the `ReceiverBasic` class located in `ufo/automator/basic.py`.
- Initialize the `Your_Receiver` class with the object that executes the native API calls. For example, if your API is based on a `com` object, initialize the `com` object in the `__init__` method of the `Your_Receiver` class.
Expand Down Expand Up @@ -52,7 +52,7 @@ class WinCOMReceiverBasic(ReceiverBasic):
```
---

**4. Define Methods to Execute Native API Calls:**
#### 4. Define Methods to Execute Native API Calls

- Define the methods in the `Your_Receiver` class to execute the native API calls.

Expand All @@ -77,7 +77,7 @@ def table2markdown(self, sheet_name: str) -> str:
---


**5. Create a Factory Class:**
#### 5. Create a Factory Class

- Create your Factory class inheriting from the `APIReceiverFactory` class to manage multiple `Receiver` classes that share the same API type.
- Implement the `create_receiver` and `name` methods in the `ReceiverFactory` class. The `create_receiver` method should return the `Receiver` class.
Expand Down Expand Up @@ -134,7 +134,7 @@ The `Receiver` class is now ready to receive the native API calls from the `AppA

Commands are the actions that the `AppAgent` can execute on the application. To create a command for the native API, you need to create a `Command` class that contains the method to execute the native API calls.

**1. Create a Command Class:**
#### 1. Create a Command Class

- Create a `Command` class in the same Python file where the `Receiver` class is located. The `Command` class should inherit from the `CommandBasic` class located in `ufo/automator/basic.py`.

Expand Down Expand Up @@ -167,7 +167,7 @@ class WinCOMCommand(CommandBasic):
```
---

**2. Define the Execute Method:**
#### 2. Define the Execute Method

- Define the `execute` method in the `Command` class to call the receiver to execute the native API calls.

Expand Down Expand Up @@ -204,11 +204,11 @@ The `Command` class is now registered in the `Receiver` class and available for

To let the `AppAgent` know the usage of the native API calls, you need to provide prompt descriptions.

**1. Create an api.yaml File:**
#### 1. Create an api.yaml File

- Create an `api.yaml` file in the `ufo/prompts/apps/{your_app_name}` directory.

**2. Define Prompt Descriptions:**
#### 2. Define Prompt Descriptions

- Define the prompt descriptions for the native API calls in the `api.yaml` file.

Expand All @@ -234,7 +234,7 @@ usage: |-
The `table2markdown` is the name of the native API call. It `MUST` match the `name()` defined in the corresponding `Command` class!


**3. Register the Prompt Address in config_dev.yaml:**
#### 3. Register the Prompt Address in `config_dev.yaml`

- Register the prompt address by adding to the `APP_API_PROMPT_ADDRESS` field of `config_dev.yaml` file with the application program name as the key and the prompt file address as the value.

Expand Down

0 comments on commit b46cb89

Please sign in to comment.