init commit

Optimose · Feb 6, 2024 · bb935bc · bb935bc
1 parent 7e08311
commit bb935bc
Show file tree

Hide file tree

Showing 26 changed files with 2,012 additions and 432 deletions.
diff --git a/.gitignore b/.gitignore
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,14 @@
+# Contributing
+
+This project welcomes contributions and suggestions. Most contributions require you to
+agree to a Contributor License Agreement (CLA) declaring that you have the right to,
+and actually do, grant us the rights to use your contribution. For details, visit
+https://cla.microsoft.com.
+
+When you submit a pull request, a CLA-bot will automatically determine whether you need
+to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the
+instructions provided by the bot. You will only need to do this once across all repositories using our CLA.
+
+This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
+For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
+or contact [[email protected]](mailto:[email protected]) with any additional questions or comments.
diff --git a/DISCLAIMER.md b/DISCLAIMER.md
@@ -0,0 +1,35 @@
+# Disclaimer: Code Execution and Data Handling Notice
+
+By choosing to run the provided code, you acknowledge and agree to the following terms and conditions regarding the functionality and data handling practices:
+
+## 1. Code Functionality:
+The code you are about to execute has the capability to capture screenshots of your working desktop environment and active applications. These screenshots will be processed and sent to the GPT model for inference.
+
+## 2. Data Transmission:
+Upon execution, the captured screenshots will be transmitted to external servers hosting the GPT model. This transmission is necessary for the inference process to generate relevant outputs based on the visual information provided.
+
+## 3. Data Privacy and Storage:
+It is crucial to note that Microsoft, the provider of this code, explicitly states that it does not collect or save any of the transmitted data. The captured screenshots are processed in real-time for the purpose of inference, and no permanent storage or record of this data is retained by Microsoft.
+
+## 4. User Responsibility:
+By running the code, you understand and accept the responsibility for the content and nature of the data present on your desktop during the execution period. It is your responsibility to ensure that no sensitive or confidential information is visible or captured during this process.
+
+## 5. Security Measures:
+Microsoft has implemented security measures to safeguard the data transmission process. However, it is recommended that you run the code in a secure and controlled environment to minimize potential risks. Ensure that you are running the latest security updates on your system.
+
+## 6. Consent for Inference:
+You explicitly provide consent for the GPT model to analyze the captured screenshots for the purpose of generating relevant outputs. This consent is inherent in the act of executing the code.
+
+## 7. No Guarantee of Accuracy:
+The outputs generated by the GPT model are based on patterns learned during training and may not always be accurate or contextually relevant. Microsoft does not guarantee the accuracy or suitability of the inferences made by the model.
+
+## 8. Indemnification:
+Users agree to defend, indemnify, and hold Microsoft harmless from and against all damages, costs, and attorneys' fees in connection with any claims arising from the use of this Repo.
+
+## 9. Reporting Infringements:
+If anyone believes that this Repo infringes on their rights, please notify the project owner via the provided project owner email. Microsoft will investigate and take appropriate actions as necessary.
+
+## 10. Modifications to the Disclaimer:
+Microsoft reserves the right to update or modify this disclaimer at any time without prior notice. It is your responsibility to review the disclaimer periodically for any changes.
+
+By proceeding to execute the code, you acknowledge that you have read, understood, and agreed to the terms outlined in this disclaimer. If you do not agree with these terms, refrain from running the provided code.
diff --git a/LICENSE b/LICENSE
@@ -1,21 +1,21 @@
-    MIT License
+Copyright (c) Microsoft Corporation.
 
-    Copyright (c) Microsoft Corporation.
+MIT License
 
-    Permission is hereby granted, free of charge, to any person obtaining a copy
-    of this software and associated documentation files (the "Software"), to deal
-    in the Software without restriction, including without limitation the rights
-    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-    copies of the Software, and to permit persons to whom the Software is
-    furnished to do so, subject to the following conditions:
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
 
-    The above copyright notice and this permission notice shall be included in all
-    copies or substantial portions of the Software.
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
 
-    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-    SOFTWARE
+THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -1,33 +1,130 @@
-# Project
+<!-- <h1 align="center">
+    UFO<img src="./assets/ufo.png" width="40px"/> :A <strong>U</strong>I-<strong>F</strong>ocused Multimodal Agent for Windows <strong>O</strong>S
+</h1> -->
 
-> This repo has been populated by an initial template to help get you started. Please
-> make sure to update the content to build a great experience for community-building.
+# **UFO** ![ufo](./assets/ufo_blue.png =x24): A **U**I-**F**ocused Agent for Windows **O**S Interaction
 
-As the maintainer of this project, please make a few updates:
+<div align="center">
 
-- Improving this README.MD file to provide a great experience
-- Updating SUPPORT.MD with content about this project's support experience
-- Understanding the security reporting process in SECURITY.MD
-- Remove this section from the README
+![Python Version](https://img.shields.io/badge/Python-3776AB?&logo=python&logoColor=white-blue&label=3.10%20%7C%203.11)&ensp;
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)&ensp;
+![Welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)
 
-## Contributing
+</div>
 
-This project welcomes contributions and suggestions.  Most contributions require you to agree to a
-Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
-the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
+**UFO** is a **UI-Focused** dual-agent framework that seamlessly navigates and operates within individual applications and across them to fulfill user requests on **Windows OS**, even when spanning multiple applications.
 
-When you submit a pull request, a CLA bot will automatically determine whether you need to provide
-a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
-provided by the bot. You will only need to do this once across all repos using our CLA.
+<h1 align="center">
+    <img src="./assets/overview.png"/> 
+</h1>
+
+
+## 🆕 News
+- 📅 2024-02-30 UFO is released on GitHub🎈.
+
+
+## 💥 Highlights
+
+- [x] **First Windows Agent Framework** - UFO represents the first agent framework that can translate user request in natural language into grounded operation on Windows OS.
+- [x] **Interactive Mode** - UFO allows for multiple sub-requests from users in the same session for completing complex task.
+- [x] **Action Safeguard** - UFO supports safeguard to prompt for user confirmation when the action is sensitive.
+- [x] **Easy Extension** - UFO is easy to extend to accomplish more complex tasks with different operations.
+
+
+## ✨ Getting Started
+
+
+### 🛠️ Step 1: Installation
+UFO requires **Python >= 3.10** running on **Windows OS >= 10**. It can be installed by running the following command:
+```bash
+# [optional to create conda environment]
+# conda create -n ufo python=3.10
+# conda activate ufo
+
+# clone the repository
+git clone https://github.com/microsoft/UFO.git
+cd UFO
+# install the requirements
+pip install -r requirements.txt
+```
+
+### 🖊️ Step 2: Configure the LLMs
+Before running UFO, you need to provide your LLM configurations. Taking OpenAI as an example, you can configure `ufo/config/config.yaml` file as follows. 
+
+#### OpenAI
+```
+OPENAI_API_BASE: Your OpenAI Endpoint # The base URL for the OpenAI API
+OPENAI_API_KEY: Your OpenAI Key  # Set the value to sk-xxx if you host the openai interface for open llm model
+OPENAI_API_MODEL: GPT Model Name  # The only OpenAI model by now that accepts visual input
+```
+
+### 🚩 Step 3: Start UFO
+
+#### ⌨️ Command Line (CLI)
+
+```bash
+# assume you are in the cloned UFO folder
+python -m ufo --task <your_task_name>
+```
+
+This will start the UFO process and you can interact with it through the command line interface. 
+If everything goes well, you will see the following message:
+
+```bash
+Welcome to use UFO🛸, A UI-focused Multimodal Agent for Windows OS. 
+ _   _  _____   ___
+| | | ||  ___| / _ \
+| | | || |_   | | | |
+| |_| ||  _|  | |_| |
+ \___/ |_|     \___/
+Please enter your request to be completed🛸:
+```
+#### <**Reminder: Before inputing your request, please make sure the targeted applications are active on the system.**>
+
+
+###  Step 4 🎥: Execution Logs 
+
+You can find the screenshots taken and request and reponse logs in the following folder:
+```
+ufo/logs/<your_task_name>/
+```
+You may use them to debug, replay, or analyze the agent output.
+
+
+## ❓Get help 
+* ❔GitHub Issues (prefered)
+* For other communications, please contact [email protected]
+---
+
+## 🎬 Demo Examples
+
+We present two demos videos that complete user request on Windows OS using UFO.
+
+#### 1️⃣🗑️ Example 1: Deleting all notes on a PowerPoint presentation.
+In this example, we will show you how to use UFO to deleting all notes on a PowerPoint presentation with just a few simple steps. Explore it to work smarter not harder!
+
+
+#### 2️⃣📧 Example 2: Composing an email using text from multiple sources.
+In this example, we will show you how to use UFO to extract texts from Word documents, description of an image, to compose an email and send. Enjoy your cross-application experiment with UFO!
+
+
+## 📚 Citation
+Our paper could be found [here](http://export.arxiv.org/abs/2311.17541). 
+If you use UFO in your research, please cite our paper:
+```
+@article{ufo,
+  title={UFO: A UI-Focused Agent for Windows OS Interaction},
+  author={Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang,  Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang},
+  journal={arXiv preprint arXiv:2311.17541},
+  year={2024}
+}
+```
 
-This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
-For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
-contact [[email protected]](mailto:[email protected]) with any additional questions or comments.
 
 ## Trademarks
 
 This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft 
 trademarks or logos is subject to and must follow 
 [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
 Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
-Any use of third-party trademarks or logos are subject to those third-party's policies.
+Any use of third-party trademarks or logos are subject to those third-party's policies.
diff --git a/assets/overview.png b/assets/overview.png
diff --git a/assets/ufo_blue.png b/assets/ufo_blue.png
diff --git a/assets/ufo_rv.png b/assets/ufo_rv.png
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,8 @@
+art==6.1
+colorama==0.4.6
+msal==1.25.0
+openai==1.11.1
+Pillow==10.2.0
+pywinauto==0.6.8
+PyYAML==6.0.1
+Requests==2.31.0
diff --git a/ufo/__init__.py b/ufo/__init__.py
diff --git a/ufo/__main__.py b/ufo/__main__.py
@@ -0,0 +1,5 @@
+from . import ufo
+
+if __name__ == "__main__":
+    # Execute the main script
+    ufo.main()
diff --git a/ufo/config/__init__.py b/ufo/config/__init__.py
diff --git a/ufo/config/config.py b/ufo/config/config.py
@@ -0,0 +1,24 @@
+import os
+import yaml
+
+
+def load_config(config_path="ufo/config/config.yaml"):
+    """
+    Load the configuration from a YAML file and environment variables.
+
+    :param config_path: The path to the YAML config file. Defaults to "./config.yaml".
+    :return: Merged configuration from environment variables and YAML file.
+    """
+    # Copy environment variables to avoid modifying them directly
+    configs = dict(os.environ)
+
+    try:
+        with open(config_path, "r") as file:
+            yaml_data = yaml.safe_load(file)
+        # Update configs with YAML data
+        if yaml_data:
+            configs.update(yaml_data)
+    except FileNotFoundError:
+        print(f"Warning: Config file not found at {config_path}. Using only environment variables.")
+
+    return configs
diff --git a/ufo/config/config.yaml b/ufo/config/config.yaml
@@ -0,0 +1,40 @@
+version: 0.1
+
+OPENAI_API_BASE: "https://cloudgpt-swc.openai.azure.com/openai/deployments/gpt-4-visual-preview/chat/completions?api-version=2023-12-01-preview" # The base URL for the OpenAI API
+OPENAI_API_KEY: ""  # Set the value to sk-xxx if you host the openai interface for open llm model
+OPENAI_API_MODEL: "gpt-4-visual-preview"  # The only OpenAI model by now that accepts visual input
+CONTROL_BACKEND: "uia"  # The backend for control action
+MAX_TOKENS: 2000  # The max token limit for the response completion
+MAX_RETRY: 3  # The max retry limit for the response completion
+MAX_STEP: 30  # The max step limit for completing the user request
+SLEEP_TIME: 5  # The sleep time between each step to wait for the window to be ready
+TEMPERATURE: 0.0  # The temperature of the model: the lower the value, the more consistent the output of the model
+TOP_P: 0.0  # The top_p of the model: the lower the value, the more conservative the output of the model
+SAFE_GUARD: True  # Whether to use the safe guard to prevent the model from doing sensitve operations.
+CONTROL_TYPE_LIST: ["Button", "Edit", "TabItem", "Document", "ListItem", "MenuItem", "ScrollBar", "TreeItem", "Hyperlink", "ComboBox", "RadioButton"]  # The list of control types that are allowed to be selected 
+HISTORY_KEYS: ["Step", "Thought", "ControlText", "Action", "Comment", "Results"]  # The keys of the action history for the next step.
+ANNOTATION_COLORS: {
+        "Button": "#FFF68F",
+        "Edit": "#A5F0B5",
+        "TabItem": "#A5E7F0",
+        "Document": "#FFD18A",
+        "ListItem": "#D9C3FE",
+        "MenuItem": "#E7FEC3",
+        "ScrollBar": "#FEC3F8",
+        "TreeItem": "#D6D6D6",
+        "Hyperlink": "#91FFEB",
+        "ComboBox": "#D8B6D4"
+    }
+
+PRINT_LOG: FALSE  # Whether to print the log  
+CONCAT_SCREENSHOT: True  # Whether to concat the screenshot for the control item
+LOG_LEVEL: "DEBUG"  # The log level
+INCLUDE_LAST_SCREENSHOT: True  # Whether to include the last screenshot in the observation
+REQUEST_TIMEOUT: 250  # The call timeout for the GPT-V model
+APP_SELECTION_PROMPT: "ufo/prompts/base/app_selection.yaml"  # The prompt for the app selection
+ACTION_SELECTION_PROMPT: "ufo/prompts/base/action_selection.yaml"  # The prompt for the action selection
+INPUT_TEXT_API: "type_keys" # The input text API
+
+
+
+
diff --git a/ufo/llm/__init__.py b/ufo/llm/__init__.py
diff --git a/ufo/llm/llm_call.py b/ufo/llm/llm_call.py
@@ -0,0 +1,59 @@
+import requests
+import time
+from ..config.config import load_config
+from ..utils import print_with_color
+
+configs = load_config()
+
+
+def get_gptv_completion(messages, headers):
+    """
+    Get GPT-V completion from messages.
+    messages: The messages to be sent to GPT-V.
+    headers: The headers of the request.
+    endpoint: The endpoint of the request.
+    max_tokens: The maximum number of tokens to generate.
+    temperature: The sampling temperature.
+    model: The model to use.
+    max_retry: The maximum number of retries.
+    return: The response of the request.
+    """
+
+    payload = {
+        "messages": messages,
+        "temperature": configs["TEMPERATURE"],
+        "max_tokens": configs["MAX_TOKENS"],
+        "top_p": configs["TOP_P"],
+        "model": configs["OPENAI_API_MODEL"]
+    }
+
+
+    for _ in range(configs["MAX_RETRY"]):
+        try:
+            response = requests.post(configs["OPENAI_API_BASE"], headers=headers, json=payload)
+            response_json = response.json()
+            response.raise_for_status()  # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
+
+
+            if "choices" not in response_json:
+                print_with_color(f"GPT Error: No Reply", "red")
+                continue
+
+            if "error" not in response_json:
+                usage = response_json.get("usage", {})
+                prompt_tokens = usage.get("prompt_tokens", 0)
+                completion_tokens = usage.get("completion_tokens", 0)
+
+                cost = prompt_tokens / 1000 * 0.01 + completion_tokens / 1000 * 0.03
+
+            return response_json, cost
+        except requests.RequestException as e:
+            print_with_color(f"Error making API request: {e}", "red")
+            print_with_color(str(response_json), "red")
+            try:
+                print_with_color(response.json(), "red")
+            except:
+                _ 
+            time.sleep(3)
+            continue
+