Merge pull request microsoft#70 from microsoft/pre-release

New release for v0.2.0
Hitomi-Hoshi · May 8, 2024 · d29ace8 · d29ace8
2 parents bb57247 + 4c633bc
commit d29ace8
Show file tree

Hide file tree

Showing 91 changed files with 10,077 additions and 2,330 deletions.
diff --git a/.gitignore b/.gitignore
@@ -10,6 +10,7 @@
 __pycache__/
 **/__pycache__/
 *.pyc
+/.VSCodeCounter
 
 # Ignore the config file
 ufo/config/config.yaml
@@ -21,8 +22,10 @@ ufo/rag/app_docs/*
 learner/records.json
 vectordb/docs/*
 vectordb/experience/*
+vectordb/demonstration/*
 
 # Don't ignore the example files
 !vectordb/docs/example/
+!vectordb/demonstration/example.yaml
 
 .vscode
diff --git a/README.md b/README.md
@@ -22,17 +22,22 @@
 
 ## 🕌 Framework
 <b>UFO</b> <img src="./assets/ufo_blue.png" alt="UFO Image" width="24"> operates as a dual-agent framework, encompassing:
-- <b>AppAgent 🤖</b>, tasked with choosing an application for fulfilling user requests. This agent may also switch to a different application when a request spans multiple applications, and the task is partially completed in the preceding application. 
-- <b>ActAgent 👾</b>, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application. 
-- <b>Control Interaction 🎮</b>, is tasked with translating actions from AppAgent and ActAgent into interactions with the application and its UI controls. It's essential that the targeted controls are compatible with the Windows **UI Automation** API.
+- <b>HostAgent (Previously AppAgent) 🤖</b>, tasked with choosing an application for fulfilling user requests. This agent may also switch to a different application when a request spans multiple applications, and the task is partially completed in the preceding application. 
+- <b>AppAgent (Previously ActAgent) 👾</b>, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application. 
+- <b>Control Interaction 🎮</b>, is tasked with translating actions from HostAgent and AppAgent into interactions with the application and its UI controls. It's essential that the targeted controls are compatible with the Windows **UI Automation** API.
 
 Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our [technical report](https://arxiv.org/abs/2402.07939).
 <h1 align="center">
-    <img src="./assets/framework.png"/> 
+    <img src="./assets/framework_v2.png"/> 
 </h1>
 
 
 ## 📢 News
+- 📅 2024-05-08: **New Release for v0.1.1!** We've made some significant updates! Previously known as AppAgent and ActAgent, we've rebranded them to HostAgent and AppAgent to better align with their functionalities. Explore the latest enhancements:
+    1. **Learning from Human Demonstration:** UFO now supports learning from human demonstration! Utilize the [Windows Step Recorder](https://support.microsoft.com/en-us/windows/record-steps-to-reproduce-a-problem-46582a9b-620f-2e36-00c9-04e25d784e47) to record your steps and demonstrate them for UFO. Refer to our detailed guide in [README.md](/record_processor/README.md) for more information.
+    2. **Win32 Support:** We've incorporated support for [Win32](https://learn.microsoft.com/en-us/windows/win32/controls/window-controls) as a control backend, enhancing our UI automation capabilities.
+    3. **Extended Application Interaction:** UFO now goes beyond UI controls, allowing interaction with your application through keyboard inputs and native APIs! Presently, we support Word ([examples](/ufo/prompts/apps/word/api.yaml)), with more to come soon. Customize and build your own interactions.
+    4. **Control Filtering:** Streamline LLM's action process by using control filters to remove irrelevant control items. Enable them in [config_dev.yaml](/ufo/config/config_dev.yaml) under the `control filtering` section at the bottom.
 - 📅 2024-03-25: **New Release for v0.0.1!** Check out our exciting new features:
     1. We now support creating your help documents for each Windows application to become an app expert. Check the [README](./learner/README.md) for more details!
     2. UFO now supports RAG from offline documents and online Bing search.
@@ -80,10 +85,11 @@ git clone https://github.com/microsoft/UFO.git
 cd UFO
 # install the requirements
 pip install -r requirements.txt
+# If you want to use the Qwen as your LLMs, uncomment the related libs.
 ```
 
 ### ⚙️ Step 2: Configure the LLMs
-Before running UFO, you need to provide your LLM configurations **individully for AppAgent and ActAgent**. You can create your own config file `ufo/config/config.yaml`, by copying the `ufo/config/config.yaml.template` and editing config for **APP_AGENT** and **ACTION_AGENT** as follows: 
+Before running UFO, you need to provide your LLM configurations **individully for HostAgent and AppAgent**. You can create your own config file `ufo/config/config.yaml`, by copying the `ufo/config/config.yaml.template` and editing config for **APP_AGENT** and **ACTION_AGENT** as follows: 
 
 #### OpenAI
 ```bash
@@ -105,17 +111,19 @@ API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
 API_MODEL: "gpt-4-vision-preview",  # The only OpenAI model by now that accepts visual input
 API_DEPLOYMENT_ID: "YOUR_AOAI_DEPLOYMENT", # The deployment id for the AOAI API
 ```
-You can also non-visial model (e.g., GPT-4) for each agent, by setting `VISUAL_MODE: True` and proper `API_MODEL` (openai) and `API_DEPLOYMENT_ID` (aoai). You can also optionally set an backup LLM engine in the field of `BACKUP_AGENT` if the above engines failed during the inference.
+You can also non-visial model (e.g., GPT-4) for each agent, by setting `VISUAL_MODE: False` and proper `API_MODEL` (openai) and `API_DEPLOYMENT_ID` (aoai). You can also optionally set an backup LLM engine in the field of `BACKUP_AGENT` if the above engines failed during the inference.
 
 
 ####  Non-Visual Model Configuration
-You can utilize non-visual models (e.g., GPT-4) for each agent by configuring the following settings in the config.yaml file:
+You can utilize non-visual models (e.g., GPT-4) for each agent by configuring the following settings in the `config.yaml` file:
 
 - ```VISUAL_MODE: False # To enable non-visual mode.```
 - Specify the appropriate `API_MODEL` (OpenAI) and `API_DEPLOYMENT_ID` (AOAI) for each agent.
 
 Optionally, you can set a backup language model (LLM) engine in the `BACKUP_AGENT` field to handle cases where the primary engines fail during inference. Ensure you configure these settings accurately to leverage non-visual models effectively.
 
+#### NOTE
+💡 UFO also supports other LLMs and advanced configurations, such as customize your own model, please check the [documents](./model_worker/readme.md) for more details. Because of the limitations of model input, a lite version of the prompt is provided to allow users to experience it, which is configured in `config_dev`.yaml.
 
 ### 📔 Step 3: Additional Setting for RAG (optional).
 If you want to enhance UFO's ability with external knowledge, you can optionally configure it with an external database for retrieval augmented generation (RAG) in the `ufo/config/config.yaml` file.
@@ -157,6 +165,15 @@ RAG_EXPERIENCE: True  # Whether to use the RAG from its self-experience.
 RAG_EXPERIENCE_RETRIEVED_TOPK: 5  # The topk for the offline retrieved documents
 ```
 
+#### RAG from User-Demonstration
+Boost UFO's capabilities through user demonstration! Utilize Microsoft Steps Recorder to record step-by-step processes for achieving specific tasks. With a simple command processed by the record_processor (refer to the [README](./record_processor/README.md)), UFO can store these trajectories in its memory for future reference, enhancing its learning from user interactions.
+
+You can enable this function by setting the following configuration:
+```bash
+## RAG Configuration for demonstration
+RAG_DEMONSTRATION: True  # Whether to use the RAG from its user demonstration.
+RAG_DEMONSTRATION_RETRIEVED_TOPK: 5  # The topk for the demonstration examples.
+```
 
 
 ### 🎉 Step 4: Start UFO
@@ -232,7 +249,7 @@ Please consult the [WindowsBench](https://arxiv.org/pdf/2402.07939.pdf) provided
 
 
 ## 📚 Citation
-Our technical report paper can be found [here](https://arxiv.org/abs/2402.07939). 
+Our technical report paper can be found [here](https://arxiv.org/abs/2402.07939). Note that previous AppAgent and ActAgent in the paper are renamed to HostAgent and AppAgent in the code base to better reflect their functions.
 If you use UFO in your research, please cite our paper:
 ```
 @article{ufo,
@@ -245,9 +262,9 @@ If you use UFO in your research, please cite our paper:
 
 ## 📝 Todo List
 - [x] RAG enhanced UFO.
+- [x] Support more control using Win32 API.
 - [ ] Documentation.
 - [ ] Support local host GUI interaction model.
-- [ ] Support more control using Win32 API.
 - [ ] Chatbox GUI for UFO.
 
 

diff --git a/assets/framework_v2.png b/assets/framework_v2.png
diff --git a/assets/record_processor/add_comment.png b/assets/record_processor/add_comment.png
diff --git a/learner/README.md b/learner/README.md
@@ -30,3 +30,14 @@ Replace `app_name` with the name of the application, such as PowerPoint or WeCha
 Replace `path_of_the_docs` with the full path to the folder containing all your documents.
 
 This command will create an offline indexer for all documents in the `path_of_the_docs` folder using Faiss and embedding with sentence transformer (more embeddings will be supported soon). The created index by default will be placed [here](../vectordb/docs/).
+
+
+
+## How to Enable RAG from Help Documents during Online Inference ❓
+To enable this in online inference, you can set the following configuration in the `ufo/config/config.yaml` file:
+```bash
+## RAG Configuration for the offline docs
+RAG_OFFLINE_DOCS: True  # Whether to use the offline RAG.
+RAG_OFFLINE_DOCS_RETRIEVED_TOPK: 1  # The topk for the offline retrieved documents
+```
+Adjust `RAG_OFFLINE_DOCS_RETRIEVED_TOPK` to optimize performance.
diff --git a/learner/basic.py b/learner/basic.py
@@ -16,15 +16,13 @@ def __init__(self, extensions: str = None, directory: str = None):
         self.extensions = extensions
         self.directory = directory
 
-
     def load_file_name(self):
         """
         Load the documents from the given directory.
         :param directory: The directory to load from.
         :return: The list of loaded documents.
         """
         return utils.find_files_with_extension(self.directory, self.extensions)
-
 
     def construct_document_list(self):
         """
@@ -33,7 +31,3 @@ def construct_document_list(self):
         :return: The list of metadata for the loaded documents.
         """
         pass
-
-
-
-
diff --git a/learner/indexer.py b/learner/indexer.py
@@ -6,8 +6,8 @@
 from langchain_community.embeddings import HuggingFaceEmbeddings
 from langchain_community.vectorstores import FAISS
 import os
-os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
 
+os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
 
 
 def create_indexer(app: str, docs: str, format: str, incremental: bool, save_path: str):
@@ -31,35 +31,41 @@ def create_indexer(app: str, docs: str, format: str, incremental: bool, save_pat
     loader = xml_loader.XMLLoader(docs)
     documents = loader.construct_document()
 
-    print_with_color("Creating indexer for {num} documents for {app}...".format(num=len(documents), app=app), "yellow")
+    print_with_color(
+        "Creating indexer for {num} documents for {app}...".format(
+            num=len(documents), app=app
+        ),
+        "yellow",
+    )
 
     if format == "xml":
-        embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
+        embeddings = HuggingFaceEmbeddings(
+            model_name="sentence-transformers/all-mpnet-base-v2"
+        )
     else:
         raise ValueError("Invalid format: " + format)
-    
+
     db = FAISS.from_documents(documents, embeddings)
 
     if incremental:
         if app in records:
             print_with_color("Merging with previous indexer...", "yellow")
             prev_db = FAISS.load_local(records[app], embeddings)
             db.merge_from(prev_db)
-        
+
     db_file_path = os.path.join(save_path, app)
     db_file_path = os.path.abspath(db_file_path)
     db.save_local(db_file_path)
 
     records[app] = db_file_path
 
-
     save_json_file("./learner/records.json", records)
 
-    print_with_color("Indexer for {app} created successfully. Save in {path}.".format(app=app, path=db_file_path), "green")
+    print_with_color(
+        "Indexer for {app} created successfully. Save in {path}.".format(
+            app=app, path=db_file_path
+        ),
+        "green",
+    )
 
     return db_file_path
-
-
-
-
-
diff --git a/learner/learn.py b/learner/learn.py
@@ -5,32 +5,43 @@
 from . import indexer
 
 
-
-# configs = load_config()
-
 args = argparse.ArgumentParser()
-args.add_argument("--app", help="The name of application to learn.",
-                  type=str, default="./")
-args.add_argument("--docs", help="The help application of the app.", type=str,
-                  default="./")
-args.add_argument("--format", help="The format of the help doc.", type=str,
-                  default="xml")
-args.add_argument('--incremental', action='store_true', help='Enable incremental update.')
-args.add_argument("--save_path", help="The format of the help doc.", type=str,
-                  default="./vectordb/docs/")
-
-
+args.add_argument(
+    "--app", help="The name of application to learn.", type=str, default="./"
+)
+args.add_argument(
+    "--docs", help="The help application of the app.", type=str, default="./"
+)
+args.add_argument(
+    "--format", help="The format of the help doc.", type=str, default="xml"
+)
+args.add_argument(
+    "--incremental", action="store_true", help="Enable incremental update."
+)
+args.add_argument(
+    "--save_path",
+    help="The format of the help doc.",
+    type=str,
+    default="./vectordb/docs/",
+)
 
 
 parsed_args = args.parse_args()
 
+
 def main():
     """
     Main function.
     """
 
-    indexer.create_indexer(parsed_args.app, parsed_args.docs, parsed_args.format, parsed_args.incremental, parsed_args.save_path)
+    indexer.create_indexer(
+        parsed_args.app,
+        parsed_args.docs,
+        parsed_args.format,
+        parsed_args.incremental,
+        parsed_args.save_path,
+    )
 
 
 if __name__ == "__main__":
-    main()
+    main()
diff --git a/model_worker/README.md b/model_worker/README.md
@@ -0,0 +1,63 @@
+### NOTE
+The lite version of the prompt is not fully optimized. To achieve better results, it is recommended that users adjust the prompt according to performance!!!
+### If you use QWEN as the Agent
+
+1. QWen (Tongyi Qianwen) is a LLM developed by Alibaba. Go to [QWen](https://dashscope.aliyun.com/) and register an account and get the API key. More details can be found [here](https://help.aliyun.com/zh/dashscope/developer-reference/activate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.0.7b5749d72j3SYU) (in Chinese).
+2. Install the required packages dashscope or run the `setup.py` with `-qwen` options.
+```bash
+pip install dashscope
+```
+3. Add following configuration to `config.yaml`:
+```json showLineNumbers
+{
+    "API_TYPE": "Qwen" ,
+    "API_KEY": "YOUR_KEY",  
+    "API_MODEL": "YOUR_MODEL"
+}
+```
+NOTE: `API_MODEL` is the model name of QWen LLM API. 
+You can find the model name in the [QWen LLM model list](https://help.aliyun.com/zh/dashscope/developer-reference/model-square/?spm=a2c4g.11186623.0.0.35a36ffdt97ljI).
+
+### If you use Ollama as the Agent
+1. Go to [Ollama](https://github.com/jmorganca/ollama) and follow the instructions to serve a LLM model on your local environment.
+We provide a short example to show how to configure the ollama in the following, which might change if ollama makes updates.
+
+```bash title="install ollama and serve LLMs in local" showLineNumbers
+## Install ollama on Linux & WSL2 or run the `setup.py` with `-ollama` options
+curl https://ollama.ai/install.sh | sh
+## Run the serving
+ollama serve
+```
+Open another terminal and run:
+```bash
+ollama run YOUR_MODEL
+```
+
+***info***
+When serving LLMs via Ollama, it will by default start a server at `http://localhost:11434`, which will later be used as the API base in `config.yaml`.
+
+
+2. Add following configuration to `config.yaml`:
+```json showLineNumbers
+{
+    "API_TYPE": "Ollama" ,
+    "API_BASE": "YOUR_ENDPOINT",   
+    "API_MODEL": "YOUR_MODEL"
+}
+```
+NOTE: `API_BASE` is the URL started in the Ollama LLM server and `API_MODEL` is the model name of Ollama LLM, it should be same as the one you served before. In addition, due to model limitations, you can use lite version of prompt to have a taste on UFO which can be configured in `config_dev.yaml`. Attention to the top ***note***.
+
+#### If you use your custom model as the Agent
+1. Start a server with your model, which will later be used as the API base in `config.yaml`.
+
+2. Add following configuration to `config.yaml`:
+```json showLineNumbers
+{
+    "API_TYPE": "custom_model" ,
+    "API_BASE": "YOUR_ENDPOINT", 
+    "API_KEY": "YOUR_KEY",  
+    "API_MODEL": "YOUR_MODEL"
+}
+```
+
+NOTE: You should create a new Python script <custom_model>.py in the ufo/llm folder like the format of the <placeholder>.py, which needs to inherit `BaseService` as the parent class, as well as the `__init__` and `chat_completion` methods. At the same time, you need to add the dynamic import of your file in the `get_service` method of `BaseService`.