Skip to content

Commit

Permalink
new release
Browse files Browse the repository at this point in the history
  • Loading branch information
vyokky committed Mar 25, 2024
1 parent b74b192 commit 1d6d443
Show file tree
Hide file tree
Showing 5 changed files with 173 additions and 141 deletions.
56 changes: 46 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,11 @@ Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend th


## 📢 News
- 📅 2024-03-25: **New Release for v0.1**! Check out our exciting new features:
1. UFO now supports RAG from offline document and online Bing search.
2. You can save the task completion trajectory for UFO's reference. It can improves its future successful rate!
3. We now support creating your help documents for each Windows applications to become an app expert. Check the [README](./learner/README.md) for more details!
4. You cam customize different GPT models for AppAgent and ActAgent. Text-only models (e.g. GPT-4) are now supported!
- 📅 2024-03-25: **New Release for v0.1!** Check out our exciting new features:
1. UFO now supports RAG from offline documents and online Bing search.
2. You can save the task completion trajectory into its memory for UFO's reference, improving its future success rate!
3. We now support creating your help documents for each Windows application to become an app expert. Check the [README](./learner/README.md) for more details!
4. You can customize different GPT models for AppAgent and ActAgent. Text-only models (e.g., GPT-4) are now supported!
- 📅 2024-02-14: Our [technical report](https://arxiv.org/abs/2402.07939) is online!
- 📅 2024-02-10: UFO is released on GitHub🎈. Happy Chinese New year🐉!

Expand Down Expand Up @@ -82,10 +82,10 @@ pip install -r requirements.txt
```

### ⚙️ Step 2: Configure the LLMs
Before running UFO, you need to provide your LLM configurations **individully for AppAgent and ActAgent**. You can create a config file `ufo/config/config_llm.yaml`, by copying the `ufo/config/config_llm.yaml.template` and editing config for "APP_AGENT" and "ACTION_AGENT" as follows:
Before running UFO, you need to provide your LLM configurations **individully for AppAgent and ActAgent**. You can create your own config file `ufo/config/config.yaml`, by copying the `ufo/config/config.yaml.template` and editing config for **APP_AGENT** and **ACTION_AGENT** as follows:

#### OpenAI
```
```bash
VISUAL_MODE: True, # Whether to use the visual mode
API_TYPE: "openai" , # The API type, "openai" for the OpenAI API.
API_BASE: "https://api.openai.com/v1/chat/completions", # The the OpenAI API endpoint.
Expand All @@ -95,7 +95,7 @@ API_MODEL: "gpt-4-vision-preview", # The only OpenAI model by now that accepts
```

#### Azure OpenAI (AOAI)
```
```bash
API_TYPE: "aoai" , # The API type, "aoai" for the Azure OpenAI.
API_BASE: "YOUR_ENDPOINT", # The AOAI API address. Format: https://{your-resource-name}.openai.azure.com
API_KEY: "YOUR_KEY", # The aoai API key
Expand All @@ -106,8 +106,44 @@ API_DEPLOYMENT_ID: "YOUR_AOAI_DEPLOYMENT", # The deployment id for the AOAI API
You can optionally set an backup LLM engine in the field of "BACKUP_AGENT" if the above engines failed.

### 📔 Step 3: Additional Setting for RAG (optional).
If you want to enhance UFO's ability with external knowledge, you can optionallly config it with external database for retrieval augmented generation (RAG).
If you want to enhance UFO's ability with external knowledge, you can optionally configure it with an external database for retrieval augmented generation (RAG) in the `ufo/config/config.yaml` file.

#### RAG from Offline Help Document
Before enabling this function, you need to create an offline indexer for your help document. Please refer to the [README](./learner/README.md) to learn how to create an offline vectored database for retrieval. You can enable this function by setting the following configuration:
```bash
## RAG Configuration for the offline docs
RAG_OFFLINE_DOCS: True # Whether to use the offline RAG.
RAG_OFFLINE_DOCS_RETRIEVED_TOPK: 1 # The topk for the offline retrieved documents
```
Adjust `RAG_OFFLINE_DOCS_RETRIEVED_TOPK` to optimize performance.


#### RAG from Online Bing Search Engine
Enhance UFO's ability by utilizing the most up-to-date online search results! To use this function, you need to obtain a Bing search API key. Activate this feature by setting the following configuration:
```bash
## RAG Configuration for the Bing search
BING_API_KEY: "YOUR_BING_SEARCH_API_KEY" # The Bing search API key
RAG_ONLINE_SEARCH: True # Whether to use the online search for the RAG.
RAG_ONLINE_SEARCH_TOPK: 5 # The topk for the online search
RAG_ONLINE_RETRIEVED_TOPK: 1 # The topk for the online retrieved documents
```
Adjust `RAG_ONLINE_SEARCH_TOPK` and `RAG_ONLINE_RETRIEVED_TOPK` to get better performance.


#### RAG from Self-Demonstration
Save task completion trajectories into UFO's memory for future reference. This can improve its future success rates based on its previous experiences!

After completing a task, you'll see the following message:
```
Would you like to save the current conversation flow for future reference by the agent?
[Y] for yes, any other key for no.
```
Press `Y` to save it into its memory and enable memory retrieval via the following configuration:
```bash
## RAG Configuration for experience
RAG_EXPERIENCE: True # Whether to use the RAG from its self-experience.
RAG_EXPERIENCE_RETRIEVED_TOPK: 5 # The topk for the offline retrieved documents
```



Expand Down Expand Up @@ -137,7 +173,7 @@ Please enter your request to be completed🛸:
- The GPT-V accepts screenshots of your desktop and application GUI as input. Please ensure that no sensitive or confidential information is visible or captured during the execution process. For further information, refer to [DISCLAIMER.md](./DISCLAIMER.md).


### Step 4 🎥: Execution Logs
### Step 5 🎥: Execution Logs

You can find the screenshots taken and request & response logs in the following folder:
```
Expand Down
6 changes: 3 additions & 3 deletions ufo/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@ def load_config(config_path="ufo/config/"):
# Update configs with YAML data
if yaml_data:
configs.update(yaml_data)
with open(path + "config_llm.yaml", "r") as file:
yaml_llm_data = yaml.safe_load(file)
with open(path + "config_dev.yaml", "r") as file:
yaml_dev_data = yaml.safe_load(file)
# Update configs with YAML data
if yaml_data:
configs.update(yaml_llm_data)
configs.update(yaml_dev_data)
except FileNotFoundError:
print_with_color(
f"Warning: Config file not found at {config_path}. Using only environment variables.", "yellow")
Expand Down
130 changes: 86 additions & 44 deletions ufo/config/config.yaml.template
Original file line number Diff line number Diff line change
@@ -1,59 +1,101 @@
version: 0.1
APP_AGENT: {
VISUAL_MODE: True, # Whether to use the visual mode

BING_API_KEY: "YOUR_BING_SEARCH_API_KEY" # The Bing search API key
API_TYPE: "openai" , # The API type, "openai" for the OpenAI API, "aoai" for the AOAI API, 'azure_ad' for the ad authority of the AOAI API.
API_BASE: "https://api.openai.com/v1/chat/completions", # The the OpenAI API endpoint, "https://api.openai.com/v1/chat/completions" for the OpenAI API.
API_KEY: "sk-", # The OpenAI API key, begin with sk-
API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
API_MODEL: "gpt-4-vision-preview", # The only OpenAI model by now that accepts visual input


### Comment above and uncomment these if using "aoai".
# API_TYPE: "aoai" , # The API type, "openai" for the OpenAI API, "aoai" for the Azure OpenAI.
# API_BASE: "YOUR_ENDPOINT", # The the OpenAI API endpoint, "https://api.openai.com/v1/chat/completions" for the OpenAI API. As for the aoai, it should be https://{your-resource-name}.openai.azure.com
# API_KEY: "YOUR_KEY", # The aoai API key
# API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
# API_MODEL: "YOUR_MODEL", # The only OpenAI model by now that accepts visual input
# API_DEPLOYMENT_ID: "gpt-4-visual-preview", # The deployment id for the AOAI API

### For Azure_AD
# AAD_TENANT_ID: "YOUR_TENANT_ID", # Set the value to your tenant id for the llm model
# AAD_API_SCOPE: "YOUR_SCOPE", # Set the value to your scope for the llm model
# AAD_API_SCOPE_BASE: "YOUR_SCOPE_BASE" # Set the value to your scope base for the llm model, whose format is API://YOUR_SCOPE_BASE, and the only need is the YOUR_SCOPE_BASE
}

ACTION_AGENT: {
VISUAL_MODE: True, # Whether to use the visual mode

API_TYPE: "openai" , # The API type, "openai" for the OpenAI API, "aoai" for the AOAI API, 'azure_ad' for the ad authority of the AOAI API.
API_BASE: "https://api.openai.com/v1/chat/completions", # The the OpenAI API endpoint, "https://api.openai.com/v1/chat/completions" for the OpenAI API.
API_KEY: "sk-", # The OpenAI API key, begin with sk-
API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
API_MODEL: "gpt-4-vision-preview", # The only OpenAI model by now that accepts visual input


### Comment above and uncomment these if using "aoai".
# API_TYPE: "aoai" , # The API type, "openai" for the OpenAI API, "aoai" for the Azure OpenAI.
# API_BASE: "YOUR_ENDPOINT", # The the OpenAI API endpoint, "https://api.openai.com/v1/chat/completions" for the OpenAI API. As for the aoai, it should be https://{your-resource-name}.openai.azure.com
# API_KEY: "YOUR_KEY", # The aoai API key
# API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
# API_MODEL: "YOUR_MODEL", # The only OpenAI model by now that accepts visual input
# API_DEPLOYMENT_ID: "gpt-4-visual-preview", # The deployment id for the AOAI API

### For Azure_AD
# AAD_TENANT_ID: "YOUR_TENANT_ID", # Set the value to your tenant id for the llm model
# AAD_API_SCOPE: "YOUR_SCOPE", # Set the value to your scope for the llm model
# AAD_API_SCOPE_BASE: "YOUR_SCOPE_BASE" # Set the value to your scope base for the llm model, whose format is API://YOUR_SCOPE_BASE, and the only need is the YOUR_SCOPE_BASE
}

BACKUP_AGENT: {
VISUAL_MODE: True, # Whether to use the visual mode

API_TYPE: "openai" , # The API type, "openai" for the OpenAI API, "aoai" for the AOAI API, 'azure_ad' for the ad authority of the AOAI API.
API_BASE: "https://api.openai.com/v1/chat/completions", # The the OpenAI API endpoint, "https://api.openai.com/v1/chat/completions" for the OpenAI API.
API_KEY: "sk-", # The OpenAI API key, begin with sk-
API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
API_MODEL: "gpt-4-vision-preview", # The only OpenAI model by now that accepts visual input


### Comment above and uncomment these if using "aoai".
# API_TYPE: "aoai" , # The API type, "openai" for the OpenAI API, "aoai" for the Azure OpenAI.
# API_BASE: "YOUR_ENDPOINT", # The the OpenAI API endpoint, "https://api.openai.com/v1/chat/completions" for the OpenAI API. As for the aoai, it should be https://{your-resource-name}.openai.azure.com
# API_KEY: "YOUR_KEY", # The aoai API key
# API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
# API_MODEL: "YOUR_MODEL", # The only OpenAI model by now that accepts visual input
# API_DEPLOYMENT_ID: "gpt-4-visual-preview", # The deployment id for the AOAI API

### For Azure_AD
# AAD_TENANT_ID: "YOUR_TENANT_ID", # Set the value to your tenant id for the llm model
# AAD_API_SCOPE: "YOUR_SCOPE", # Set the value to your scope for the llm model
# AAD_API_SCOPE_BASE: "YOUR_SCOPE_BASE" # Set the value to your scope base for the llm model, whose format is API://YOUR_SCOPE_BASE, and the only need is the YOUR_SCOPE_BASE
}

CONTROL_BACKEND: "uia" # The backend for control action
MAX_STEP: 30 # The max step limit for completing the user request
SLEEP_TIME: 5 # The sleep time between each step to wait for the window to be ready
SAFE_GUARD: True # Whether to use the safe guard to prevent the model from doing sensitve operations.
CONTROL_TYPE_LIST: ["Button", "Edit", "TabItem", "Document", "ListItem", "MenuItem", "ScrollBar", "TreeItem", "Hyperlink", "ComboBox", "RadioButton"] # The list of control types that are allowed to be selected
HISTORY_KEYS: ["Step", "Thought", "ControlText", "Action", "Comment", "Results"] # The keys of the action history for the next step.
ANNOTATION_COLORS: {
"Button": "#FFF68F",
"Edit": "#A5F0B5",
"TabItem": "#A5E7F0",
"Document": "#FFD18A",
"ListItem": "#D9C3FE",
"MenuItem": "#E7FEC3",
"ScrollBar": "#FEC3F8",
"TreeItem": "#D6D6D6",
"Hyperlink": "#91FFEB",
"ComboBox": "#D8B6D4"
}

PRINT_LOG: False # Whether to print the log
CONCAT_SCREENSHOT: True # Whether to concat the screenshot for the control item
LOG_LEVEL: "DEBUG" # The log level
INCLUDE_LAST_SCREENSHOT: True # Whether to include the last screenshot in the observation
REQUEST_TIMEOUT: 250 # The call timeout for the GPT-V model

APP_SELECTION_PROMPT: "ufo/prompts/base/{mode}/app_selection.yaml" # The prompt for the app selection
ACTION_SELECTION_PROMPT: "ufo/prompts/base/{mode}/action_selection.yaml" # The prompt for the action selection

APP_SELECTION_EXAMPLE_PROMPT: "ufo/prompts/examples/{mode}/app_example.yaml" # The prompt for the app selection
ACTION_SELECTION_EXAMPLE_PROMPT: "ufo/prompts/examples/{mode}/action_example.yaml" # The prompt for the action selection

## For experience learning
EXPERIENCE_PROMPT: "ufo/prompts/experience/{mode}/experience_summary.yaml"
EXPERIENCE_SAVED_PATH: "vectordb/experience/"


API_PROMPT: "ufo/prompts/base/{mode}/api.yaml" # The prompt for the API
INPUT_TEXT_API: "type_keys" # The input text API
INPUT_TEXT_ENTER: True # whether to press enter after typing the text

### For GPT parameters
MAX_TOKENS: 2000 # The max token limit for the response completion
MAX_RETRY: 3 # The max retry limit for the response completion
TEMPERATURE: 0.0 # The temperature of the model: the lower the value, the more consistent the output of the model
TOP_P: 0.0 # The top_p of the model: the lower the value, the more conservative the output of the model
TIMEOUT: 60 # The call timeout(s), default is 10 mins


### For RAG

## RAG Configuration for the offline docs
RAG_OFFLINE_DOCS: False # Whether to use the offline RAG.
RAG_OFFLINE_DOCS_RETRIEVED_TOPK: 1 # The topk for the offline retrieved documents


## RAG Configuration for the Bing search
BING_API_KEY: "YOUR_BING_SEARCH_API_KEY" # The Bing search API key
RAG_ONLINE_SEARCH: False # Whether to use the online search for the RAG.
RAG_ONLINE_SEARCH_TOPK: 5 # The topk for the online search
RAG_ONLINE_RETRIEVED_TOPK: 1 # The topk for the online retrieved documents


## RAG Configuration for experience
RAG_EXPERIENCE: True # Whether to use the offline RAG.
RAG_EXPERIENCE: True # Whether to use the RAG from its self-experience.
RAG_EXPERIENCE_RETRIEVED_TOPK: 5 # The topk for the offline retrieved documents





38 changes: 38 additions & 0 deletions ufo/config/config_dev.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
CONTROL_BACKEND: "uia" # The backend for control action
MAX_STEP: 30 # The max step limit for completing the user request
SLEEP_TIME: 5 # The sleep time between each step to wait for the window to be ready
SAFE_GUARD: True # Whether to use the safe guard to prevent the model from doing sensitve operations.
CONTROL_TYPE_LIST: ["Button", "Edit", "TabItem", "Document", "ListItem", "MenuItem", "ScrollBar", "TreeItem", "Hyperlink", "ComboBox", "RadioButton"] # The list of control types that are allowed to be selected
HISTORY_KEYS: ["Step", "Thought", "ControlText", "Action", "Comment", "Results"] # The keys of the action history for the next step.
ANNOTATION_COLORS: {
"Button": "#FFF68F",
"Edit": "#A5F0B5",
"TabItem": "#A5E7F0",
"Document": "#FFD18A",
"ListItem": "#D9C3FE",
"MenuItem": "#E7FEC3",
"ScrollBar": "#FEC3F8",
"TreeItem": "#D6D6D6",
"Hyperlink": "#91FFEB",
"ComboBox": "#D8B6D4"
}

PRINT_LOG: False # Whether to print the log
CONCAT_SCREENSHOT: True # Whether to concat the screenshot for the control item
LOG_LEVEL: "DEBUG" # The log level
INCLUDE_LAST_SCREENSHOT: True # Whether to include the last screenshot in the observation
REQUEST_TIMEOUT: 250 # The call timeout for the GPT-V model

APP_SELECTION_PROMPT: "ufo/prompts/base/{mode}/app_selection.yaml" # The prompt for the app selection
ACTION_SELECTION_PROMPT: "ufo/prompts/base/{mode}/action_selection.yaml" # The prompt for the action selection

APP_SELECTION_EXAMPLE_PROMPT: "ufo/prompts/examples/{mode}/app_example.yaml" # The prompt for the app selection
ACTION_SELECTION_EXAMPLE_PROMPT: "ufo/prompts/examples/{mode}/action_example.yaml" # The prompt for the action selection

## For experience learning
EXPERIENCE_PROMPT: "ufo/prompts/experience/{mode}/experience_summary.yaml"
EXPERIENCE_SAVED_PATH: "vectordb/experience/"

API_PROMPT: "ufo/prompts/base/{mode}/api.yaml" # The prompt for the API
INPUT_TEXT_API: "type_keys" # The input text API
INPUT_TEXT_ENTER: True # whether to press enter after typing the text
Loading

0 comments on commit 1d6d443

Please sign in to comment.