Merge pull request microsoft#38 from microsoft/pre-release

release v0.0.1
Hitomi-Hoshi · Mar 25, 2024 · 0f0a1c7 · 0f0a1c7
2 parents 82c41d8 + 1db2fe4
commit 0f0a1c7
Show file tree

Hide file tree

Showing 51 changed files with 4,726 additions and 897 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,8 +1,28 @@
+# Ignore login file
+*.bin
+
 # Ignore Jupyter Notebook checkpoints
 .ipynb_checkpoints
 /test/*
+/deprecated/*
 /test/*.ipynb
 /logs/*
 __pycache__/
 **/__pycache__/
 *.pyc
+
+# Ignore the config file
+ufo/config/config.yaml
+ufo/config/config_llm.yaml
+
+
+# Ignore the helper files
+ufo/rag/app_docs/*
+learner/records.json
+vectordb/docs/*
+vectordb/experience/*
+
+# Don't ignore the example files
+!vectordb/docs/example/
+
+.vscode
diff --git a/README.md b/README.md
@@ -33,29 +33,38 @@ Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend th
 
 
 ## 📢 News
+- 📅 2024-03-25: **New Release for v0.0.1!** Check out our exciting new features:
+    1. We now support creating your help documents for each Windows application to become an app expert. Check the [README](./learner/README.md) for more details!
+    2. UFO now supports RAG from offline documents and online Bing search.
+    3. You can save the task completion trajectory into its memory for UFO's reference, improving its future success rate!
+    4. You can customize different GPT models for AppAgent and ActAgent. Text-only models (e.g., GPT-4) are now supported!
 - 📅 2024-02-14: Our [technical report](https://arxiv.org/abs/2402.07939) is online!
 - 📅 2024-02-10: UFO is released on GitHub🎈. Happy Chinese New year🐉!
 
 
 ## 🌐 Media Coverage 
+
 UFO sightings have garnered attention from various media outlets, including:
 - [Microsoft's UFO abducts traditional user interfaces for a smarter Windows experience](https://the-decoder.com/microsofts-ufo-abducts-traditional-user-interfaces-for-a-smarter-windows-experience/)
+- [🚀 UFO & GPT-4-V: Sit back and relax, mientras GPT lo hace todo🌌](https://www.linkedin.com/posts/gutierrezfrancois_ai-ufo-microsoft-activity-7176819900399652865-pLoo?utm_source=share&utm_medium=member_desktop)
 - [The AI PC - The Future of Computers? - Microsoft UFO](https://www.youtube.com/watch?v=1k4LcffCq3E)
 - [下一代Windows系统曝光：基于GPT-4V，Agent跨应用调度，代号UFO](https://www.qbitai.com/2024/02/121048.html)
 - [下一代智能版 Windows 要来了？微软推出首个 Windows Agent，命名为 UFO！](https://blog.csdn.net/csdnnews/article/details/136161570)
 - [Microsoft発のオープンソース版「UFO」登場！　Windowsを自動操縦するAIエージェントを試す](https://internet.watch.impress.co.jp/docs/column/shimizu/1570581.html)
 - ...
 
+These sources provide insights into the evolving landscape of technology and the implications of UFO phenomena on various platforms.
+
 
 ## 💥 Highlights
 
 - [x] **First Windows Agent** - UFO is the pioneering agent framework capable of translating user requests in natural language into actionable operations on Windows OS.
+- [x] **RAG Enhanced** - UFO is enhanced by Retrieval Augmented Generation (RAG) from heterogeneous sources to promote its ability, including offling help documents and online search engine.
 - [x] **Interactive Mode** - UFO facilitates multiple sub-requests from users within the same session, enabling the completion of complex tasks seamlessly.
 - [x] **Action Safeguard** - UFO incorporates safeguards to prompt user confirmation for sensitive actions, enhancing security and preventing inadvertent operations.
 - [x] **Easy Extension** - UFO offers extensibility, allowing for the integration of additional functionalities and control types to tackle diverse and intricate tasks with ease.
 
 
-
 ## ✨ Getting Started
 
 
@@ -74,26 +83,83 @@ pip install -r requirements.txt
 ```
 
 ### ⚙️ Step 2: Configure the LLMs
-Before running UFO, you need to provide your LLM configurations. You can configure `ufo/config/config.yaml` file as follows. 
+Before running UFO, you need to provide your LLM configurations **individully for AppAgent and ActAgent**. You can create your own config file `ufo/config/config.yaml`, by copying the `ufo/config/config.yaml.template` and editing config for **APP_AGENT** and **ACTION_AGENT** as follows: 
 
 #### OpenAI
-```
-API_TYPE: "openai" 
-OPENAI_API_BASE: "https://api.openai.com/v1/chat/completions" # The base URL for the OpenAI API
-OPENAI_API_KEY: "YOUR_API_KEY"  # Set the value to the openai key for the llm model
-OPENAI_API_MODEL: "GPTV_MODEL_NAME"  # The only OpenAI model by now that accepts visual input
+```bash
+VISUAL_MODE: True, # Whether to use the visual mode
+API_TYPE: "openai" , # The API type, "openai" for the OpenAI API.  
+API_BASE: "https://api.openai.com/v1/chat/completions", # The the OpenAI API endpoint.
+API_KEY: "sk-",  # The OpenAI API key, begin with sk-
+API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
+API_MODEL: "gpt-4-vision-preview",  # The only OpenAI model by now that accepts visual input
 ```
 
 #### Azure OpenAI (AOAI)
+```bash
+VISUAL_MODE: True, # Whether to use the visual mode
+API_TYPE: "aoai" , # The API type, "aoai" for the Azure OpenAI.  
+API_BASE: "YOUR_ENDPOINT", #  The AOAI API address. Format: https://{your-resource-name}.openai.azure.com
+API_KEY: "YOUR_KEY",  # The aoai API key
+API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
+API_MODEL: "gpt-4-vision-preview",  # The only OpenAI model by now that accepts visual input
+API_DEPLOYMENT_ID: "YOUR_AOAI_DEPLOYMENT", # The deployment id for the AOAI API
+```
+You can also non-visial model (e.g., GPT-4) for each agent, by setting `VISUAL_MODE: True` and proper `API_MODEL` (openai) and `API_DEPLOYMENT_ID` (aoai). You can also optionally set an backup LLM engine in the field of `BACKUP_AGENT` if the above engines failed during the inference.
+
+
+####  Non-Visual Model Configuration
+You can utilize non-visual models (e.g., GPT-4) for each agent by configuring the following settings in the config.yaml file:
+
+- ```VISUAL_MODE: False # To enable non-visual mode.```
+- Specify the appropriate `API_MODEL` (OpenAI) and `API_DEPLOYMENT_ID` (AOAI) for each agent.
+
+Optionally, you can set a backup language model (LLM) engine in the `BACKUP_AGENT` field to handle cases where the primary engines fail during inference. Ensure you configure these settings accurately to leverage non-visual models effectively.
+
+
+### 📔 Step 3: Additional Setting for RAG (optional).
+If you want to enhance UFO's ability with external knowledge, you can optionally configure it with an external database for retrieval augmented generation (RAG) in the `ufo/config/config.yaml` file.
+
+#### RAG from Offline Help Document
+Before enabling this function, you need to create an offline indexer for your help document. Please refer to the [README](./learner/README.md) to learn how to create an offline vectored database for retrieval. You can enable this function by setting the following configuration:
+```bash
+## RAG Configuration for the offline docs
+RAG_OFFLINE_DOCS: True  # Whether to use the offline RAG.
+RAG_OFFLINE_DOCS_RETRIEVED_TOPK: 1  # The topk for the offline retrieved documents
 ```
-API_TYPE: "aoai" 
-OPENAI_API_BASE: "YOUR_ENDPOINT" # The AOAI API address. Format: https://{your-resource-name}.openai.azure.com/openai/deployments/{deployment-id}/chat/completions?api-version={api-version}
-OPENAI_API_KEY: "YOUR_API_KEY"  # Set the value to the openai key for the llm model
-OPENAI_API_MODEL: "GPTV_MODEL_NAME"  # The only OpenAI model by now that accepts visual input
+Adjust `RAG_OFFLINE_DOCS_RETRIEVED_TOPK` to optimize performance.
+
+
+####  RAG from Online Bing Search Engine
+Enhance UFO's ability by utilizing the most up-to-date online search results! To use this function, you need to obtain a Bing search API key. Activate this feature by setting the following configuration:
+```bash
+## RAG Configuration for the Bing search
+BING_API_KEY: "YOUR_BING_SEARCH_API_KEY"  # The Bing search API key
+RAG_ONLINE_SEARCH: True  # Whether to use the online search for the RAG.
+RAG_ONLINE_SEARCH_TOPK: 5  # The topk for the online search
+RAG_ONLINE_RETRIEVED_TOPK: 1 # The topk for the online retrieved documents
 ```
+Adjust `RAG_ONLINE_SEARCH_TOPK` and `RAG_ONLINE_RETRIEVED_TOPK` to get better performance.
+
+
+#### RAG from Self-Demonstration
+Save task completion trajectories into UFO's memory for future reference. This can improve its future success rates based on its previous experiences!
+
+After completing a task, you'll see the following message:
+```
+Would you like to save the current conversation flow for future reference by the agent?
+[Y] for yes, any other key for no.
+```
+Press `Y` to save it into its memory and enable memory retrieval via the following configuration:
+```bash
+## RAG Configuration for experience
+RAG_EXPERIENCE: True  # Whether to use the RAG from its self-experience.
+RAG_EXPERIENCE_RETRIEVED_TOPK: 5  # The topk for the offline retrieved documents
+```
+
 
 
-### 🎉 Step 3: Start UFO
+### 🎉 Step 4: Start UFO
 
 #### ⌨️ You can execute the following on your Windows command Line (CLI):
 
@@ -119,7 +185,7 @@ Please enter your request to be completed🛸:
 - The GPT-V accepts screenshots of your desktop and application GUI as input. Please ensure that no sensitive or confidential information is visible or captured during the execution process. For further information, refer to [DISCLAIMER.md](./DISCLAIMER.md).
 
 
-###  Step 4 🎥: Execution Logs 
+###  Step 5 🎥: Execution Logs 
 
 You can find the screenshots taken and request & response logs in the following folder:
 ```
@@ -178,11 +244,11 @@ If you use UFO in your research, please cite our paper:
 ```
 
 ## 📝 Todo List
-- ⏩ Documentation.
-- ⏩ Support local host GUI interaction model.
-- ⏩ Support more control using Win32 API.
-- ⏩ RAG enhanced UFO.
-- ⏩ Chatbox GUI for UFO.
+- [x] RAG enhanced UFO.
+- [ ] Documentation.
+- [ ] Support local host GUI interaction model.
+- [ ] Support more control using Win32 API.
+- [ ] Chatbox GUI for UFO.
 
 
 

diff --git a/learner/README.md b/learner/README.md
@@ -0,0 +1,32 @@
+
+# Enhancing UFO with RAG using Offline Help Documents
+
+
+## How to Prepare Your Help Documents ❓
+
+### Step 1: Prepare Your Help Doc and Metadata
+
+UFO currently supports processing help documents in XML format, as this is the default format for official help documents of Microsoft apps. More formats will be supported in the future.
+
+You can write a dedicated document for a specific task of an app in a file named, for example, `task.xml`. Note that it should be accompanied by a metadata file with the same prefix, but with the `.meta` extension, i.e., `task.xml.meta`. This metadata file should have a `title` describing the task at a high level and a `Content-Summary` field summarizing the content of the help document. These two files are used for similarity search with user requests, so please write them carefully. The [ppt-copilot.xml](./doc_example/ppt-copilot.xml) and [ppt-copilot.xml.meta](./doc_example/ppt-copilot.xml.meta) are examples of a help document and its metadata.
+
+### Step 2: Prepare Your Help Document Set
+
+Once you have all help documents and metadata ready, put all of them into a folder. There can be sub-folders for the help documents, but please ensure that each help document and its corresponding metadata **are placed in the same directory**.
+
+
+## How to Create an Indexer for Your Help Document Set ❓
+
+
+Once you have all documents ready in a folder named `path_of_the_docs`, you can easily create an offline indexer to support RAG for UFO. Follow these steps:
+
+```console
+# assume you are in the cloned UFO folder
+python -m learner --app <app_name> --docs <path_of_the_docs>
+```
+Replace `app_name` with the name of the application, such as PowerPoint or WeChat.
+> Note: Ensure the `app_name` is accurately defined as it is used to match the offline indexer in online RAG.
+
+Replace `path_of_the_docs` with the full path to the folder containing all your documents.
+
+This command will create an offline indexer for all documents in the `path_of_the_docs` folder using Faiss and embedding with sentence transformer (more embeddings will be supported soon). The created index by default will be placed [here](../vectordb/docs/).
diff --git a/learner/__init__.py b/learner/__init__.py
@@ -0,0 +1,2 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
diff --git a/learner/__main__.py b/learner/__main__.py
@@ -0,0 +1,8 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+from . import learn
+
+if __name__ == "__main__":
+    # Execute the main script
+    learn.main()
diff --git a/learner/basic.py b/learner/basic.py
@@ -0,0 +1,39 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+from . import utils
+
+
+class BasicDocumentLoader:
+    """
+    A class to load documents from a list of files with a given extension list.
+    """
+
+    def __init__(self, extensions: str = None, directory: str = None):
+        """
+        Create a new BasicDocumentLoader.
+        :param extensions: The extensions to load.
+        """
+        self.extensions = extensions
+        self.directory = directory
+
+
+    def load_file_name(self):
+        """
+        Load the documents from the given directory.
+        :param directory: The directory to load from.
+        :return: The list of loaded documents.
+        """
+        return utils.find_files_with_extension(self.directory, self.extensions)
+
+
+    def construct_document_list(self):
+        """
+        Load the metadata from the given directory.
+        :param directory: The directory to load from.
+        :return: The list of metadata for the loaded documents.
+        """
+        pass
+
+
+
+
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# Copyright (c) Microsoft Corporation.
		# Licensed under the MIT License.