forked from mnotgod96/AppAgent
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
jiaxuanliu
committed
Dec 21, 2023
0 parents
commit 6d608cc
Showing
18 changed files
with
1,619 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2023 Jiaxuan Liu | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,148 @@ | ||
# AppAgent | ||
|
||
<div align="center"> | ||
|
||
<a href='https://arxiv.org/abs/2311.16483'><img src='https://img.shields.io/badge/arXiv-2311.16483-b31b1b.svg'></a> | ||
<a href='https://github.com/appagent-official/appagent-official.github.io'><img src='https://img.shields.io/badge/Project-Page-Green'></a> | ||
<a href='https://github.com/buaacyw/GaussianEditor/blob/master/LICENSE.txt'><img src='https://img.shields.io/badge/License-MIT-blue'></a> | ||
<br><br> | ||
<!-- [![Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue)](https://huggingface.co/listen2you002/ChartLlama-13b) | ||
[![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/listen2you002/ChartLlama-Dataset) --> | ||
|
||
[**Chi Zhang***](https://icoz69.github.io/), [**Zhao Yang***](), [**Jiaxuan Liu***](https://www.linkedin.com/in/jiaxuan-liu-9051b7105/), [Yucheng Han](http://tingxueronghua.github.io), [Xin Chen](https://chenxin.tech/), [Zebiao Huang](), | ||
<br> | ||
[Bin Fu](https://openreview.net/profile?id=~BIN_FU2), [Gang Yu (Corresponding Author)](https://www.skicyyu.org/) | ||
<br> | ||
(* equal contributions) | ||
</div> | ||
|
||
![](./assets/teaser.png) | ||
|
||
## 🔆 Introduction | ||
|
||
We introduce a novel LLM-based multimodal agent framework designed to operate smartphone applications. | ||
|
||
Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. | ||
|
||
Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. | ||
|
||
## 📝 Changelog | ||
- __[2023.12.21]__: 🔥🔥 Open-source the git repository, including the detailed configuration steps to implement our AppAgent! | ||
|
||
## ✨ Demo | ||
|
||
The demo video shows the process of using AppAgent to follow a user on X (Twitter) in the deployment phase. | ||
|
||
https://github.com/mnotgod96/AppAgent/assets/27103154/e99c0e14-f61e-4921-ba20-e9c7aa611c34 | ||
|
||
|
||
|
||
|
||
|
||
|
||
## 🚀 Quick Start | ||
|
||
This section will guide you on how to quickly use gpt-4-vision-preview as an agent to complete specific tasks for you on | ||
your Android app. | ||
|
||
### ⚙️ Step 1. Prerequisites | ||
|
||
1. Get an Android device and enable the USB debugging that can be found in Developer Options in Settings. | ||
|
||
2. On your PC, download and install [Android Debug Bridge](https://developer.android.com/tools/adb) (adb) which is a | ||
command-line tool that lets you communicate with your Android device from the PC. | ||
|
||
3. Connect your device to your PC using a USB cable. | ||
|
||
4. Clone this repo and install the dependencies. All scripts in this project are written in Python 3 so make sure you | ||
have installed it. | ||
|
||
```bash | ||
cd AppAgent | ||
pip install -r requirements.txt | ||
``` | ||
|
||
### 🤖 Step 2. Configure the Agent | ||
|
||
AppAgent needs to be powered by a multi-modal model which can receive both text and visual inputs. During our experiment | ||
, we used `gpt-4-vision-preview` as the model to make decisions on how to take actions to complete a task on the smartphone. | ||
|
||
To configure your requests to GPT-4V, you should modify `config.yaml` in the root directory. | ||
There are two key parameters that must be configured to try AppAgent: | ||
1. OpenAI API key: you must purchase an eligible API key from OpenAI so that you can have access to GPT-4V. | ||
2. Request interval: this is the time interval in seconds between consecutive GPT-4V requests to control the frequency | ||
of your requests to GPT-4V. Adjust this value according to the status of your account. | ||
|
||
Other parameters in `config.yaml` are well commented. Modify them as you need. | ||
|
||
> Be aware that GPT-4V is not free. Each request/response pair involved in this project costs around $0.03. Use it wisely. | ||
If you want to test AppAgent using your own models, you should modify the `ask_gpt_4v` function in `scripts/model.py` | ||
accordingly. | ||
|
||
### 🔍 Step 3. Exploration Phase | ||
|
||
Our paper proposed a novel solution that involves two phases, exploration and deployment, to turn GPT-4V into a capable | ||
agent that can help users operate their Android phones when a task is given. The exploration phase starts with a task | ||
given by you, and you can choose to let the agent either explore the app on its own or learn from your demonstration. | ||
In both cases, the agent generates documentations for elements interacted during the exploration/demonstration and | ||
saves them for use in the deployment phase. | ||
|
||
#### Option 1: Autonomous Exploration | ||
|
||
This solution features a fully autonomous exploration which allows the agent to explore the use of the app by attempting | ||
the given task without any intervention from humans. | ||
|
||
To start, run `learn.py` in the root directory. Follow prompted instructions to select `autonomous exploration` as the | ||
operating mode and provide the app name and task description. Then, your agent will do the job for you. Under this | ||
mode, AppAgent will reflect on its previous action making sure its action adheres to the given task and generate | ||
documentations for the elements explored. | ||
|
||
```bash | ||
python learn.py | ||
``` | ||
|
||
#### Option 2: Learning from Human Demonstrations | ||
|
||
This solution requires users to demonstrate a similar task first. AppAgent will learn from the demo and generate | ||
documentations for UI elements seen during the demo. | ||
|
||
To start human demonstration, you should run `learn.py` in the root directory. Follow prompted instructions to select | ||
`human demonstration` as the operating mode and provide the app name and task description. A screenshot of your phone | ||
will be captured and all interactive elements shown on the screen will be labeled with numeric tags. You need to follow | ||
the prompts to determine your next action and the target of the action. When you believe the demonstration is finished, | ||
type `stop` to end the demo. | ||
|
||
```bash | ||
python learn.py | ||
``` | ||
|
||
![](./assets/demo.png) | ||
|
||
### 📱 Step 4. Deployment Phase | ||
|
||
After the exploration phase finishes, you can run `run.py` in the root directory. Follow prompted instructions to enter | ||
the name of the app, select the appropriate documentation base you want the agent to use, and provide the task | ||
description. Then, your agent will do the job for you. The agent will automatically detect if there is documentation | ||
base generated before for the app; if there is no documentation found, you can also choose to run the agent without any | ||
documentation (success rate not guaranteed). | ||
|
||
```bash | ||
python run.py | ||
``` | ||
|
||
## 📖 TO-DO LIST | ||
- [ ] Open source the Benchmark. | ||
- [x] Open source the configuration. | ||
|
||
## 😉 Citation | ||
```bib | ||
@misc{AppAgent, | ||
author = {Chi Zhang and Zhao Yang and Jiaxuan Liu and Yucheng Han and Xin Chen and Zebiao Huang and Bin Fu and Gang Yu}, | ||
title = {AppAgent: Multimodal Agents as Smartphone Users}, | ||
year = {2023}, | ||
publisher = {GitHub}, | ||
journal = {GitHub repository}, | ||
howpublished = {\url{https://github.com/mnotgod96/AppAgent}}, | ||
} | ||
``` |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
OPENAI_API_BASE: "https://api.openai.com/v1" | ||
OPENAI_API_KEY: "sk-" # Set the value to sk-xxx if you host the openai interface for open llm model | ||
OPENAI_API_MODEL: "gpt-4-vision-preview" # The only OpenAI model by now that accepts visual input | ||
MAX_TOKENS: 300 # The max token limit for the response completion | ||
TEMPERATURE: 0.0 # The temperature of the model: the lower the value, the more consistent the output of the model | ||
REQUEST_INTERVAL: 10 # Time in seconds between consecutive GPT-4V requests | ||
|
||
ANDROID_SCREENSHOT_DIR: "/sdcard/Pictures/Screenshots" # Set the directory on your Android device to store the intermediate screenshots. Make sure the directory EXISTS on your phone! | ||
ANDROID_XML_DIR: "/sdcard" # Set the directory on your Android device to store the intermediate XML files used for determining locations of UI elements on your screen. Make sure the directory EXISTS on your phone! | ||
|
||
DOC_REFINE: false # Set this to true will make the agent refine existing documentation based on the latest demonstration; otherwise, the agent will not regenerate a new documentation for elements with the same resource ID. | ||
MAX_ROUNDS: 20 # Set the round limit for the agent to complete the task | ||
DARK_MODE: false # Set this to true if your app is in dark mode to enhance the element labeling | ||
MIN_DIST: 30 # The minimum distance between elements to prevent overlapping during the labeling process |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
import argparse | ||
import datetime | ||
import os | ||
import time | ||
|
||
from scripts.utils import print_with_color | ||
|
||
arg_desc = "AppAgent - exploration phase" | ||
parser = argparse.ArgumentParser(formatter_class=argparse.RawDescriptionHelpFormatter, description=arg_desc) | ||
parser.add_argument("--app") | ||
parser.add_argument("--root_dir", default="./") | ||
args = vars(parser.parse_args()) | ||
|
||
app = args["app"] | ||
root_dir = args["root_dir"] | ||
|
||
|
||
print_with_color("Welcome to the exploration phase of AppAgent!\nThe exploration phase aims at generating " | ||
"documentations for UI elements through either autonomous exploration or human demonstration. " | ||
"Both options are task-oriented, which means you need to give a task description. During " | ||
"autonomous exploration, the agent will try to complete the task by interacting with possible " | ||
"elements on the UI within limited rounds. Documentations will be generated during the process of " | ||
"interacting with the correct elements to proceed with the task. Human demonstration relies on " | ||
"the user to show the agent how to complete the given task, and the agent will generate " | ||
"documentations for the elements interacted during the human demo. To start, please enter the " | ||
"main interface of the app on your phone.", "yellow") | ||
print_with_color("Choose from the following modes:\n1. autonomous exploration\n2. human demonstration\n" | ||
"Type 1 or 2.", "blue") | ||
user_input = "" | ||
while user_input != "1" and user_input != "2": | ||
user_input = input() | ||
|
||
if not app: | ||
print_with_color("What is the name of the target app?", "blue") | ||
app = input() | ||
app = app.replace(" ", "") | ||
|
||
if user_input == "1": | ||
os.system(f"python scripts/self_explorer.py --app {app} --root_dir {root_dir}") | ||
else: | ||
demo_timestamp = int(time.time()) | ||
demo_name = datetime.datetime.fromtimestamp(demo_timestamp).strftime(f"demo_{app}_%Y-%m-%d_%H-%M-%S") | ||
os.system(f"python scripts/step_recorder.py --app {app} --demo {demo_name} --root_dir {root_dir}") | ||
os.system(f"python scripts/document_generation.py --app {app} --demo {demo_name} --root_dir {root_dir}") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
argparse | ||
colorama | ||
opencv-python | ||
pyshine | ||
pyyaml | ||
requests |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
import argparse | ||
import os | ||
|
||
from scripts.utils import print_with_color | ||
|
||
arg_desc = "AppAgent - deployment phase" | ||
parser = argparse.ArgumentParser(formatter_class=argparse.RawDescriptionHelpFormatter, description=arg_desc) | ||
parser.add_argument("--app") | ||
parser.add_argument("--root_dir", default="./") | ||
args = vars(parser.parse_args()) | ||
|
||
app = args["app"] | ||
root_dir = args["root_dir"] | ||
|
||
print_with_color("Welcome to the deployment phase of AppAgent!\nBefore giving me the task, you should first tell me " | ||
"the name of the app you want me to operate and what documentation base you want me to use. I will " | ||
"try my best to complete the task without your intervention. First, please enter the main interface " | ||
"of the app on your phone and provide the following information.", "yellow") | ||
|
||
if not app: | ||
print_with_color("What is the name of the target app?", "blue") | ||
app = input() | ||
app = app.replace(" ", "") | ||
|
||
os.system(f"python scripts/task_executor.py --app {app} --root_dir {root_dir}") |
Empty file.
Oops, something went wrong.