UFO is a UI-Focused dual-agent framework to fulfill user requests on Windows OS by seamlessly navigating and operating within individual or spanning multiple applications.
UFO operates as a dual-agent framework, encompassing:
- AppAgent 🤖, tasked with choosing an application for fulfilling user requests. This agent may also switch to a different application when a request spans multiple applications, and the task is partially completed in the preceding application.
- ActAgent 👾, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application.
Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our technical report.
- 📅 2024-02-08 UFO is released on GitHub🎈.
- First Windows Agent - UFO represents the first agent framework that can translate user request in natural language into grounded operation on Windows OS.
- Interactive Mode - UFO allows for multiple sub-requests from users in the same session for completing complex task.
- Action Safeguard - UFO supports safeguard to prompt for user confirmation when the action is sensitive.
- Easy Extension - UFO is easy to extend to accomplish more complex tasks with different operations.
UFO requires Python >= 3.10 running on Windows OS >= 10. It can be installed by running the following command:
# [optional to create conda environment]
# conda create -n ufo python=3.10
# conda activate ufo
# clone the repository
git clone https://github.com/microsoft/UFO.git
cd UFO
# install the requirements
pip install -r requirements.txt
Before running UFO, you need to provide your LLM configurations. Taking OpenAI as an example, you can configure ufo/config/config.yaml
file as follows.
OPENAI_API_BASE: Your OpenAI Endpoint # The base URL for the OpenAI API
OPENAI_API_KEY: Your OpenAI Key # Set the value to the openai key for the llm model
OPENAI_API_MODEL: GPT Model Name # The only OpenAI model by now that accepts visual input
OPENAI_API_BASE: Your OpenAI Endpoint # The base URL for the OpenAI API
OPENAI_API_KEY: Your OpenAI Key # Set the value to the openai key for the llm model
OPENAI_API_MODEL: GPT Model Name # The only OpenAI model by now that accepts visual input
# assume you are in the cloned UFO folder
python -m ufo --task <your_task_name>
This will start the UFO process and you can interact with it through the command line interface. If everything goes well, you will see the following message:
Welcome to use UFO🛸, A UI-focused Agent for Windows OS Interaction.
_ _ _____ ___
| | | || ___| / _ \
| | | || |_ | | | |
| |_| || _| | |_| |
\___/ |_| \___/
Please enter your request to be completed🛸:
Reminder❗: Before UFO executing your request, please make sure the targeted applications are active on the system.
You can find the screenshots taken and request and reponse logs in the following folder:
./ufo/logs/<your_task_name>/
You may use them to debug, replay, or analyze the agent output.
- ❔GitHub Issues (prefered)
- For other communications, please contact [email protected]
We present two demo videos that complete user request on Windows OS using UFO. For more cases, please consult our technical report.
In this example, we will show you how to use UFO to deleting all notes on a PowerPoint presentation with just a few simple steps. Explore it to work smarter not harder!
ufo_delete_note.mp4
In this example, we will show you how to use UFO to extract texts from Word documents, description of an image, to compose an email and send. Enjoy your cross-application experiment with UFO!
ufo_meeting_note_crossed_app_demo_new.mp4
To evaluate, please refer to the WindosBench in the Section A of Appendix in our technical report. Some tips for completing your request:
Our paper could be found here. If you use UFO in your research, please cite our paper:
@article{ufo,
title={UFO: A UI-Focused Agent for Windows OS Interaction},
author={Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang},
journal={arXiv preprint arXiv:2311.17541},
year={2024}
}
By choosing to run the provided code, you acknowledge and agree to the following terms and conditions regarding the functionality and data handling practices in DISCLAIMER.md
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.