Skip to content
/ UFO Public
forked from microsoft/UFO

A UI-Focused Agent for Windows OS Interaction.

License

Notifications You must be signed in to change notification settings

exceLISA/UFO

Repository files navigation

UFO UFO Image: A UI-Focused Agent for Windows OS Interaction

Python VersionLicense: MITWelcome

UFO is a UI-Focused dual-agent framework to fulfill user requests on Windows OS by seamlessly navigating and operating within individual or spanning multiple applications.

🕌 Framework

UFO UFO Image operates as a dual-agent framework, encompassing:

  • AppAgent 🤖, tasked with choosing an application for fulfilling user requests. This agent may also switch to a different application when a request spans multiple applications, and the task is partially completed in the preceding application.
  • ActAgent 👾, responsible for iteratively executing actions on the selected applications until the task is successfully concluded within a specific application.

Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend the application UI and fulfill the user's request. For more details, please consult our technical report.

🆕 News

  • 📅 2024-02-08 UFO is released on GitHub🎈.

💥 Highlights

  • First Windows Agent - UFO represents the first agent framework that can translate user request in natural language into grounded operation on Windows OS.
  • Interactive Mode - UFO allows for multiple sub-requests from users in the same session for completing complex task.
  • Action Safeguard - UFO supports safeguard to prompt for user confirmation when the action is sensitive.
  • Easy Extension - UFO is easy to extend to accomplish more complex tasks with different operations.

✨ Getting Started

🛠️ Step 1: Installation

UFO requires Python >= 3.10 running on Windows OS >= 10. It can be installed by running the following command:

# [optional to create conda environment]
# conda create -n ufo python=3.10
# conda activate ufo

# clone the repository
git clone https://github.com/microsoft/UFO.git
cd UFO
# install the requirements
pip install -r requirements.txt

⚙️ Step 2: Configure the LLMs

Before running UFO, you need to provide your LLM configurations. Taking OpenAI as an example, you can configure ufo/config/config.yaml file as follows.

OpenAI

API_TYPE: "openai" 
OPENAI_API_BASE: "https://api.openai.com/v1/chat/completions" # The base URL for the OpenAI API
OPENAI_API_KEY: "YOUR_API_KEY"  # Set the value to the openai key for the llm model
OPENAI_API_MODEL: "GPTV_MODEL_NAME"  # The only OpenAI model by now that accepts visual input

Azure OpenAI (AOAI)

API_TYPE: "aoai" 
OPENAI_API_BASE: "YOUR_ENDPOINT" # The AOAI API endpoint
OPENAI_API_KEY: "YOUR_API_KEY"  # Set the value to the openai key for the llm model
OPENAI_API_MODEL: "GPTV_MODEL_NAME"  # The only OpenAI model by now that accepts visual input

🎉 Step 3: Start UFO

⌨️ Command Line (CLI)

# assume you are in the cloned UFO folder
python -m ufo --task <your_task_name>

This will start the UFO process and you can interact with it through the command line interface. If everything goes well, you will see the following message:

Welcome to use UFO🛸, A UI-focused Agent for Windows OS Interaction. 
 _   _  _____   ___
| | | ||  ___| / _ \
| | | || |_   | | | |
| |_| ||  _|  | |_| |
 \___/ |_|     \___/
Please enter your request to be completed🛸:

Reminder❗: Before UFO executing your request, please make sure the targeted applications are active on the system.

Step 4 🎥: Execution Logs

You can find the screenshots taken and request & reponse logs in the following folder:

./ufo/logs/<your_task_name>/

You may use them to debug, replay, or analyze the agent output.

❓Get help


🎬 Demo Examples

We present two demo videos that complete user request on Windows OS using UFO. For more cases, please consult our technical report.

1️⃣🗑️ Example 1: Deleting all notes on a PowerPoint presentation.

In this example, we will show you how to use UFO to deleting all notes on a PowerPoint presentation with just a few simple steps. Explore it to work smarter not harder!

ufo_delete_note.mp4

2️⃣📧 Example 2: Composing an email using text from multiple sources.

In this example, we will show you how to use UFO to extract texts from Word documents, description of an image, to compose an email and send. Enjoy your cross-application experiment with UFO!

ufo_meeting_note_crossed_app_demo_new.mp4

📊 Evaluation

Please consult the WindosBench provided in Section A of the Appendix within our technical report. Here are some tips (and requirements) to aid in completing your request:

  • Prior to UFO execution of your request, ensure that the targeted application is active (though it may be minimized).
  • Occasionally, requests to GPT-V may trigger content safety measures. UFO will attempt to retry regardless, but adjusting the size or scale of the application window may prove helpful. We are actively solving this issue.
  • Please note that the output of GPT-V may not consistently align with the same request. If unsuccessful with your initial attempt, consider trying again.

📚 Citation

Our paper could be found here. If you use UFO in your research, please cite our paper:

@article{ufo,
  title={UFO: A UI-Focused Agent for Windows OS Interaction},
  author={Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang},
  journal={arXiv preprint arXiv:2311.17541},
  year={2024}
}

🎨 Related Project

You may also find TaskWeaver useful, a code-first agent framework for seamlessly planning and executing data analytics tasks.

Disclaimer

By choosing to run the provided code, you acknowledge and agree to the following terms and conditions regarding the functionality and data handling practices in DISCLAIMER.md

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

About

A UI-Focused Agent for Windows OS Interaction.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%