Skip to content

Commit

Permalink
Merge pull request microsoft#70 from microsoft/pre-release
Browse files Browse the repository at this point in the history
New release for v0.2.0
  • Loading branch information
vyokky authored May 8, 2024
2 parents bb57247 + 4c633bc commit d29ace8
Show file tree
Hide file tree
Showing 91 changed files with 10,077 additions and 2,330 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
__pycache__/
**/__pycache__/
*.pyc
/.VSCodeCounter

# Ignore the config file
ufo/config/config.yaml
Expand All @@ -21,8 +22,10 @@ ufo/rag/app_docs/*
learner/records.json
vectordb/docs/*
vectordb/experience/*
vectordb/demonstration/*

# Don't ignore the example files
!vectordb/docs/example/
!vectordb/demonstration/example.yaml

.vscode
35 changes: 26 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,22 @@

## 🕌 Framework
<b>UFO</b> <img src="./assets/ufo_blue.png" alt="UFO Image" width="24"> operates as a dual-agent framework, encompassing:
- <b>AppAgent 🤖</b>, tasked with choosing an application for fulfilling user requests. This agent may also switch to a different application when a request spans multiple applications, and the task is partially completed in the preceding application.
- <b>ActAgent 👾</b>, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application.
- <b>Control Interaction 🎮</b>, is tasked with translating actions from AppAgent and ActAgent into interactions with the application and its UI controls. It's essential that the targeted controls are compatible with the Windows **UI Automation** API.
- <b>HostAgent (Previously AppAgent) 🤖</b>, tasked with choosing an application for fulfilling user requests. This agent may also switch to a different application when a request spans multiple applications, and the task is partially completed in the preceding application.
- <b>AppAgent (Previously ActAgent) 👾</b>, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application.
- <b>Control Interaction 🎮</b>, is tasked with translating actions from HostAgent and AppAgent into interactions with the application and its UI controls. It's essential that the targeted controls are compatible with the Windows **UI Automation** API.

Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our [technical report](https://arxiv.org/abs/2402.07939).
<h1 align="center">
<img src="./assets/framework.png"/>
<img src="./assets/framework_v2.png"/>
</h1>


## 📢 News
- 📅 2024-05-08: **New Release for v0.1.1!** We've made some significant updates! Previously known as AppAgent and ActAgent, we've rebranded them to HostAgent and AppAgent to better align with their functionalities. Explore the latest enhancements:
1. **Learning from Human Demonstration:** UFO now supports learning from human demonstration! Utilize the [Windows Step Recorder](https://support.microsoft.com/en-us/windows/record-steps-to-reproduce-a-problem-46582a9b-620f-2e36-00c9-04e25d784e47) to record your steps and demonstrate them for UFO. Refer to our detailed guide in [README.md](/record_processor/README.md) for more information.
2. **Win32 Support:** We've incorporated support for [Win32](https://learn.microsoft.com/en-us/windows/win32/controls/window-controls) as a control backend, enhancing our UI automation capabilities.
3. **Extended Application Interaction:** UFO now goes beyond UI controls, allowing interaction with your application through keyboard inputs and native APIs! Presently, we support Word ([examples](/ufo/prompts/apps/word/api.yaml)), with more to come soon. Customize and build your own interactions.
4. **Control Filtering:** Streamline LLM's action process by using control filters to remove irrelevant control items. Enable them in [config_dev.yaml](/ufo/config/config_dev.yaml) under the `control filtering` section at the bottom.
- 📅 2024-03-25: **New Release for v0.0.1!** Check out our exciting new features:
1. We now support creating your help documents for each Windows application to become an app expert. Check the [README](./learner/README.md) for more details!
2. UFO now supports RAG from offline documents and online Bing search.
Expand Down Expand Up @@ -80,10 +85,11 @@ git clone https://github.com/microsoft/UFO.git
cd UFO
# install the requirements
pip install -r requirements.txt
# If you want to use the Qwen as your LLMs, uncomment the related libs.
```

### ⚙️ Step 2: Configure the LLMs
Before running UFO, you need to provide your LLM configurations **individully for AppAgent and ActAgent**. You can create your own config file `ufo/config/config.yaml`, by copying the `ufo/config/config.yaml.template` and editing config for **APP_AGENT** and **ACTION_AGENT** as follows:
Before running UFO, you need to provide your LLM configurations **individully for HostAgent and AppAgent**. You can create your own config file `ufo/config/config.yaml`, by copying the `ufo/config/config.yaml.template` and editing config for **APP_AGENT** and **ACTION_AGENT** as follows:

#### OpenAI
```bash
Expand All @@ -105,17 +111,19 @@ API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
API_MODEL: "gpt-4-vision-preview", # The only OpenAI model by now that accepts visual input
API_DEPLOYMENT_ID: "YOUR_AOAI_DEPLOYMENT", # The deployment id for the AOAI API
```
You can also non-visial model (e.g., GPT-4) for each agent, by setting `VISUAL_MODE: True` and proper `API_MODEL` (openai) and `API_DEPLOYMENT_ID` (aoai). You can also optionally set an backup LLM engine in the field of `BACKUP_AGENT` if the above engines failed during the inference.
You can also non-visial model (e.g., GPT-4) for each agent, by setting `VISUAL_MODE: False` and proper `API_MODEL` (openai) and `API_DEPLOYMENT_ID` (aoai). You can also optionally set an backup LLM engine in the field of `BACKUP_AGENT` if the above engines failed during the inference.


#### Non-Visual Model Configuration
You can utilize non-visual models (e.g., GPT-4) for each agent by configuring the following settings in the config.yaml file:
You can utilize non-visual models (e.g., GPT-4) for each agent by configuring the following settings in the `config.yaml` file:

- ```VISUAL_MODE: False # To enable non-visual mode.```
- Specify the appropriate `API_MODEL` (OpenAI) and `API_DEPLOYMENT_ID` (AOAI) for each agent.

Optionally, you can set a backup language model (LLM) engine in the `BACKUP_AGENT` field to handle cases where the primary engines fail during inference. Ensure you configure these settings accurately to leverage non-visual models effectively.

#### NOTE
💡 UFO also supports other LLMs and advanced configurations, such as customize your own model, please check the [documents](./model_worker/readme.md) for more details. Because of the limitations of model input, a lite version of the prompt is provided to allow users to experience it, which is configured in `config_dev`.yaml.

### 📔 Step 3: Additional Setting for RAG (optional).
If you want to enhance UFO's ability with external knowledge, you can optionally configure it with an external database for retrieval augmented generation (RAG) in the `ufo/config/config.yaml` file.
Expand Down Expand Up @@ -157,6 +165,15 @@ RAG_EXPERIENCE: True # Whether to use the RAG from its self-experience.
RAG_EXPERIENCE_RETRIEVED_TOPK: 5 # The topk for the offline retrieved documents
```

#### RAG from User-Demonstration
Boost UFO's capabilities through user demonstration! Utilize Microsoft Steps Recorder to record step-by-step processes for achieving specific tasks. With a simple command processed by the record_processor (refer to the [README](./record_processor/README.md)), UFO can store these trajectories in its memory for future reference, enhancing its learning from user interactions.

You can enable this function by setting the following configuration:
```bash
## RAG Configuration for demonstration
RAG_DEMONSTRATION: True # Whether to use the RAG from its user demonstration.
RAG_DEMONSTRATION_RETRIEVED_TOPK: 5 # The topk for the demonstration examples.
```


### 🎉 Step 4: Start UFO
Expand Down Expand Up @@ -232,7 +249,7 @@ Please consult the [WindowsBench](https://arxiv.org/pdf/2402.07939.pdf) provided


## 📚 Citation
Our technical report paper can be found [here](https://arxiv.org/abs/2402.07939).
Our technical report paper can be found [here](https://arxiv.org/abs/2402.07939). Note that previous AppAgent and ActAgent in the paper are renamed to HostAgent and AppAgent in the code base to better reflect their functions.
If you use UFO in your research, please cite our paper:
```
@article{ufo,
Expand All @@ -245,9 +262,9 @@ If you use UFO in your research, please cite our paper:

## 📝 Todo List
- [x] RAG enhanced UFO.
- [x] Support more control using Win32 API.
- [ ] Documentation.
- [ ] Support local host GUI interaction model.
- [ ] Support more control using Win32 API.
- [ ] Chatbox GUI for UFO.


Expand Down
Binary file added assets/framework_v2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/record_processor/add_comment.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 11 additions & 0 deletions learner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,14 @@ Replace `app_name` with the name of the application, such as PowerPoint or WeCha
Replace `path_of_the_docs` with the full path to the folder containing all your documents.

This command will create an offline indexer for all documents in the `path_of_the_docs` folder using Faiss and embedding with sentence transformer (more embeddings will be supported soon). The created index by default will be placed [here](../vectordb/docs/).



## How to Enable RAG from Help Documents during Online Inference ❓
To enable this in online inference, you can set the following configuration in the `ufo/config/config.yaml` file:
```bash
## RAG Configuration for the offline docs
RAG_OFFLINE_DOCS: True # Whether to use the offline RAG.
RAG_OFFLINE_DOCS_RETRIEVED_TOPK: 1 # The topk for the offline retrieved documents
```
Adjust `RAG_OFFLINE_DOCS_RETRIEVED_TOPK` to optimize performance.
6 changes: 0 additions & 6 deletions learner/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,13 @@ def __init__(self, extensions: str = None, directory: str = None):
self.extensions = extensions
self.directory = directory


def load_file_name(self):
"""
Load the documents from the given directory.
:param directory: The directory to load from.
:return: The list of loaded documents.
"""
return utils.find_files_with_extension(self.directory, self.extensions)


def construct_document_list(self):
"""
Expand All @@ -33,7 +31,3 @@ def construct_document_list(self):
:return: The list of metadata for the loaded documents.
"""
pass




30 changes: 18 additions & 12 deletions learner/indexer.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"


def create_indexer(app: str, docs: str, format: str, incremental: bool, save_path: str):
Expand All @@ -31,35 +31,41 @@ def create_indexer(app: str, docs: str, format: str, incremental: bool, save_pat
loader = xml_loader.XMLLoader(docs)
documents = loader.construct_document()

print_with_color("Creating indexer for {num} documents for {app}...".format(num=len(documents), app=app), "yellow")
print_with_color(
"Creating indexer for {num} documents for {app}...".format(
num=len(documents), app=app
),
"yellow",
)

if format == "xml":
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-mpnet-base-v2"
)
else:
raise ValueError("Invalid format: " + format)

db = FAISS.from_documents(documents, embeddings)

if incremental:
if app in records:
print_with_color("Merging with previous indexer...", "yellow")
prev_db = FAISS.load_local(records[app], embeddings)
db.merge_from(prev_db)

db_file_path = os.path.join(save_path, app)
db_file_path = os.path.abspath(db_file_path)
db.save_local(db_file_path)

records[app] = db_file_path


save_json_file("./learner/records.json", records)

print_with_color("Indexer for {app} created successfully. Save in {path}.".format(app=app, path=db_file_path), "green")
print_with_color(
"Indexer for {app} created successfully. Save in {path}.".format(
app=app, path=db_file_path
),
"green",
)

return db_file_path





43 changes: 27 additions & 16 deletions learner/learn.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,32 +5,43 @@
from . import indexer



# configs = load_config()

args = argparse.ArgumentParser()
args.add_argument("--app", help="The name of application to learn.",
type=str, default="./")
args.add_argument("--docs", help="The help application of the app.", type=str,
default="./")
args.add_argument("--format", help="The format of the help doc.", type=str,
default="xml")
args.add_argument('--incremental', action='store_true', help='Enable incremental update.')
args.add_argument("--save_path", help="The format of the help doc.", type=str,
default="./vectordb/docs/")


args.add_argument(
"--app", help="The name of application to learn.", type=str, default="./"
)
args.add_argument(
"--docs", help="The help application of the app.", type=str, default="./"
)
args.add_argument(
"--format", help="The format of the help doc.", type=str, default="xml"
)
args.add_argument(
"--incremental", action="store_true", help="Enable incremental update."
)
args.add_argument(
"--save_path",
help="The format of the help doc.",
type=str,
default="./vectordb/docs/",
)


parsed_args = args.parse_args()


def main():
"""
Main function.
"""

indexer.create_indexer(parsed_args.app, parsed_args.docs, parsed_args.format, parsed_args.incremental, parsed_args.save_path)
indexer.create_indexer(
parsed_args.app,
parsed_args.docs,
parsed_args.format,
parsed_args.incremental,
parsed_args.save_path,
)


if __name__ == "__main__":
main()
main()
63 changes: 63 additions & 0 deletions model_worker/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
### NOTE
The lite version of the prompt is not fully optimized. To achieve better results, it is recommended that users adjust the prompt according to performance!!!
### If you use QWEN as the Agent

1. QWen (Tongyi Qianwen) is a LLM developed by Alibaba. Go to [QWen](https://dashscope.aliyun.com/) and register an account and get the API key. More details can be found [here](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.7b5749d72j3SYU) (in Chinese).
2. Install the required packages dashscope or run the `setup.py` with `-qwen` options.
```bash
pip install dashscope
```
3. Add following configuration to `config.yaml`:
```json showLineNumbers
{
"API_TYPE": "Qwen" ,
"API_KEY": "YOUR_KEY",
"API_MODEL": "YOUR_MODEL"
}
```
NOTE: `API_MODEL` is the model name of QWen LLM API.
You can find the model name in the [QWen LLM model list](https://help.aliyun.com/zh/dashscope/developer-reference/model-square/?spm=a2c4g.11186623.0.0.35a36ffdt97ljI).

### If you use Ollama as the Agent
1. Go to [Ollama](https://github.com/jmorganca/ollama) and follow the instructions to serve a LLM model on your local environment.
We provide a short example to show how to configure the ollama in the following, which might change if ollama makes updates.

```bash title="install ollama and serve LLMs in local" showLineNumbers
## Install ollama on Linux & WSL2 or run the `setup.py` with `-ollama` options
curl https://ollama.ai/install.sh | sh
## Run the serving
ollama serve
```
Open another terminal and run:
```bash
ollama run YOUR_MODEL
```

***info***
When serving LLMs via Ollama, it will by default start a server at `http://localhost:11434`, which will later be used as the API base in `config.yaml`.


2. Add following configuration to `config.yaml`:
```json showLineNumbers
{
"API_TYPE": "Ollama" ,
"API_BASE": "YOUR_ENDPOINT",
"API_MODEL": "YOUR_MODEL"
}
```
NOTE: `API_BASE` is the URL started in the Ollama LLM server and `API_MODEL` is the model name of Ollama LLM, it should be same as the one you served before. In addition, due to model limitations, you can use lite version of prompt to have a taste on UFO which can be configured in `config_dev.yaml`. Attention to the top ***note***.

#### If you use your custom model as the Agent
1. Start a server with your model, which will later be used as the API base in `config.yaml`.

2. Add following configuration to `config.yaml`:
```json showLineNumbers
{
"API_TYPE": "custom_model" ,
"API_BASE": "YOUR_ENDPOINT",
"API_KEY": "YOUR_KEY",
"API_MODEL": "YOUR_MODEL"
}
```

NOTE: You should create a new Python script <custom_model>.py in the ufo/llm folder like the format of the <placeholder>.py, which needs to inherit `BaseService` as the parent class, as well as the `__init__` and `chat_completion` methods. At the same time, you need to add the dynamic import of your file in the `get_service` method of `BaseService`.
Loading

0 comments on commit d29ace8

Please sign in to comment.