new release

kuzja111 · Mar 25, 2024 · 1d6d443 · 1d6d443
1 parent b74b192
commit 1d6d443
Show file tree

Hide file tree

Showing 5 changed files with 173 additions and 141 deletions.
diff --git a/README.md b/README.md
@@ -33,11 +33,11 @@ Both agents leverage the multi-modal capabilities of GPT-Vision to comprehend th
 
 
 ## 📢 News
-- 📅 2024-03-25: **New Release for v0.1**! Check out our exciting new features:
-    1. UFO now supports RAG from offline document and online Bing search. 
-    2. You can save the task completion trajectory for UFO's reference. It can improves its future successful rate!
-    3. We now support creating your help documents for each Windows applications to become an app expert. Check the [README](./learner/README.md) for more details!
-    4. You cam customize different GPT models for AppAgent and ActAgent. Text-only models (e.g. GPT-4) are now supported!
+- 📅 2024-03-25: **New Release for v0.1!** Check out our exciting new features:
+    1. UFO now supports RAG from offline documents and online Bing search.
+    2. You can save the task completion trajectory into its memory for UFO's reference, improving its future success rate!
+    3. We now support creating your help documents for each Windows application to become an app expert. Check the [README](./learner/README.md) for more details!
+    4. You can customize different GPT models for AppAgent and ActAgent. Text-only models (e.g., GPT-4) are now supported!
 - 📅 2024-02-14: Our [technical report](https://arxiv.org/abs/2402.07939) is online!
 - 📅 2024-02-10: UFO is released on GitHub🎈. Happy Chinese New year🐉!
 
@@ -82,10 +82,10 @@ pip install -r requirements.txt
 ```
 
 ### ⚙️ Step 2: Configure the LLMs
-Before running UFO, you need to provide your LLM configurations **individully for AppAgent and ActAgent**. You can create a config file `ufo/config/config_llm.yaml`, by copying the `ufo/config/config_llm.yaml.template` and editing config for "APP_AGENT" and "ACTION_AGENT" as follows: 
+Before running UFO, you need to provide your LLM configurations **individully for AppAgent and ActAgent**. You can create your own config file `ufo/config/config.yaml`, by copying the `ufo/config/config.yaml.template` and editing config for **APP_AGENT** and **ACTION_AGENT** as follows: 
 
 #### OpenAI
-```
+```bash
 VISUAL_MODE: True, # Whether to use the visual mode
 API_TYPE: "openai" , # The API type, "openai" for the OpenAI API.  
 API_BASE: "https://api.openai.com/v1/chat/completions", # The the OpenAI API endpoint.
@@ -95,7 +95,7 @@ API_MODEL: "gpt-4-vision-preview",  # The only OpenAI model by now that accepts
 ```
 
 #### Azure OpenAI (AOAI)
-```
+```bash
 API_TYPE: "aoai" , # The API type, "aoai" for the Azure OpenAI.  
 API_BASE: "YOUR_ENDPOINT", #  The AOAI API address. Format: https://{your-resource-name}.openai.azure.com
 API_KEY: "YOUR_KEY",  # The aoai API key
@@ -106,8 +106,44 @@ API_DEPLOYMENT_ID: "YOUR_AOAI_DEPLOYMENT", # The deployment id for the AOAI API
 You can optionally set an backup LLM engine in the field of "BACKUP_AGENT" if the above engines failed.
 
 ### 📔 Step 3: Additional Setting for RAG (optional).
-If you want to enhance UFO's ability with external knowledge, you can optionallly config it with external database for retrieval augmented generation (RAG).
+If you want to enhance UFO's ability with external knowledge, you can optionally configure it with an external database for retrieval augmented generation (RAG) in the `ufo/config/config.yaml` file.
 
+#### RAG from Offline Help Document
+Before enabling this function, you need to create an offline indexer for your help document. Please refer to the [README](./learner/README.md) to learn how to create an offline vectored database for retrieval. You can enable this function by setting the following configuration:
+```bash
+## RAG Configuration for the offline docs
+RAG_OFFLINE_DOCS: True  # Whether to use the offline RAG.
+RAG_OFFLINE_DOCS_RETRIEVED_TOPK: 1  # The topk for the offline retrieved documents
+```
+Adjust `RAG_OFFLINE_DOCS_RETRIEVED_TOPK` to optimize performance.
+
+
+####  RAG from Online Bing Search Engine
+Enhance UFO's ability by utilizing the most up-to-date online search results! To use this function, you need to obtain a Bing search API key. Activate this feature by setting the following configuration:
+```bash
+## RAG Configuration for the Bing search
+BING_API_KEY: "YOUR_BING_SEARCH_API_KEY"  # The Bing search API key
+RAG_ONLINE_SEARCH: True  # Whether to use the online search for the RAG.
+RAG_ONLINE_SEARCH_TOPK: 5  # The topk for the online search
+RAG_ONLINE_RETRIEVED_TOPK: 1 # The topk for the online retrieved documents
+```
+Adjust `RAG_ONLINE_SEARCH_TOPK` and `RAG_ONLINE_RETRIEVED_TOPK` to get better performance.
+
+
+#### RAG from Self-Demonstration
+Save task completion trajectories into UFO's memory for future reference. This can improve its future success rates based on its previous experiences!
+
+After completing a task, you'll see the following message:
+```
+Would you like to save the current conversation flow for future reference by the agent?
+[Y] for yes, any other key for no.
+```
+Press `Y` to save it into its memory and enable memory retrieval via the following configuration:
+```bash
+## RAG Configuration for experience
+RAG_EXPERIENCE: True  # Whether to use the RAG from its self-experience.
+RAG_EXPERIENCE_RETRIEVED_TOPK: 5  # The topk for the offline retrieved documents
+```
 
 
 
@@ -137,7 +173,7 @@ Please enter your request to be completed🛸:
 - The GPT-V accepts screenshots of your desktop and application GUI as input. Please ensure that no sensitive or confidential information is visible or captured during the execution process. For further information, refer to [DISCLAIMER.md](./DISCLAIMER.md).
 
 
-###  Step 4 🎥: Execution Logs 
+###  Step 5 🎥: Execution Logs 
 
 You can find the screenshots taken and request & response logs in the following folder:
 ```

diff --git a/ufo/config/config.py b/ufo/config/config.py
@@ -25,11 +25,11 @@ def load_config(config_path="ufo/config/"):
         # Update configs with YAML data
         if yaml_data:
             configs.update(yaml_data)
-        with open(path + "config_llm.yaml", "r") as file:
-            yaml_llm_data = yaml.safe_load(file)
+        with open(path + "config_dev.yaml", "r") as file:
+            yaml_dev_data = yaml.safe_load(file)
         # Update configs with YAML data
         if yaml_data:
-            configs.update(yaml_llm_data)
+            configs.update(yaml_dev_data)
     except FileNotFoundError:
         print_with_color(
             f"Warning: Config file not found at {config_path}. Using only environment variables.", "yellow")

diff --git a/ufo/config/config.yaml.template b/ufo/config/config.yaml.template
@@ -1,59 +1,101 @@
-version: 0.1
+APP_AGENT: {
+  VISUAL_MODE: True, # Whether to use the visual mode
 
-BING_API_KEY: "YOUR_BING_SEARCH_API_KEY"  # The Bing search API key
+  API_TYPE: "openai" , # The API type, "openai" for the OpenAI API, "aoai" for the AOAI API, 'azure_ad' for the ad authority of the AOAI API.  
+  API_BASE: "https://api.openai.com/v1/chat/completions", # The the OpenAI API endpoint, "https://api.openai.com/v1/chat/completions" for the OpenAI API.
+  API_KEY: "sk-",  # The OpenAI API key, begin with sk-
+  API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
+  API_MODEL: "gpt-4-vision-preview",  # The only OpenAI model by now that accepts visual input
+
+
+  ### Comment above and uncomment these if using "aoai".
+  # API_TYPE: "aoai" , # The API type, "openai" for the OpenAI API, "aoai" for the Azure OpenAI.  
+  # API_BASE: "YOUR_ENDPOINT", # The the OpenAI API endpoint, "https://api.openai.com/v1/chat/completions" for the OpenAI API. As for the aoai, it should be https://{your-resource-name}.openai.azure.com
+  # API_KEY: "YOUR_KEY",  # The aoai API key
+  # API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
+  # API_MODEL: "YOUR_MODEL",  # The only OpenAI model by now that accepts visual input
+  # API_DEPLOYMENT_ID: "gpt-4-visual-preview", # The deployment id for the AOAI API
+
+  ### For Azure_AD
+  # AAD_TENANT_ID: "YOUR_TENANT_ID", # Set the value to your tenant id for the llm model
+  # AAD_API_SCOPE: "YOUR_SCOPE", # Set the value to your scope for the llm model
+  # AAD_API_SCOPE_BASE: "YOUR_SCOPE_BASE" # Set the value to your scope base for the llm model, whose format is API://YOUR_SCOPE_BASE, and the only need is the YOUR_SCOPE_BASE
+}
+
+ACTION_AGENT: {
+  VISUAL_MODE: True, # Whether to use the visual mode
+
+  API_TYPE: "openai" , # The API type, "openai" for the OpenAI API, "aoai" for the AOAI API, 'azure_ad' for the ad authority of the AOAI API.  
+  API_BASE: "https://api.openai.com/v1/chat/completions", # The the OpenAI API endpoint, "https://api.openai.com/v1/chat/completions" for the OpenAI API.
+  API_KEY: "sk-",  # The OpenAI API key, begin with sk-
+  API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
+  API_MODEL: "gpt-4-vision-preview",  # The only OpenAI model by now that accepts visual input
+
+
+  ### Comment above and uncomment these if using "aoai".
+  # API_TYPE: "aoai" , # The API type, "openai" for the OpenAI API, "aoai" for the Azure OpenAI.  
+  # API_BASE: "YOUR_ENDPOINT", # The the OpenAI API endpoint, "https://api.openai.com/v1/chat/completions" for the OpenAI API. As for the aoai, it should be https://{your-resource-name}.openai.azure.com
+  # API_KEY: "YOUR_KEY",  # The aoai API key
+  # API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
+  # API_MODEL: "YOUR_MODEL",  # The only OpenAI model by now that accepts visual input
+  # API_DEPLOYMENT_ID: "gpt-4-visual-preview", # The deployment id for the AOAI API
+
+  ### For Azure_AD
+  # AAD_TENANT_ID: "YOUR_TENANT_ID", # Set the value to your tenant id for the llm model
+  # AAD_API_SCOPE: "YOUR_SCOPE", # Set the value to your scope for the llm model
+  # AAD_API_SCOPE_BASE: "YOUR_SCOPE_BASE" # Set the value to your scope base for the llm model, whose format is API://YOUR_SCOPE_BASE, and the only need is the YOUR_SCOPE_BASE
+  }
+
+BACKUP_AGENT: {
+  VISUAL_MODE: True, # Whether to use the visual mode
+
+  API_TYPE: "openai" , # The API type, "openai" for the OpenAI API, "aoai" for the AOAI API, 'azure_ad' for the ad authority of the AOAI API.  
+  API_BASE: "https://api.openai.com/v1/chat/completions", # The the OpenAI API endpoint, "https://api.openai.com/v1/chat/completions" for the OpenAI API.
+  API_KEY: "sk-",  # The OpenAI API key, begin with sk-
+  API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
+  API_MODEL: "gpt-4-vision-preview",  # The only OpenAI model by now that accepts visual input
+
+
+  ### Comment above and uncomment these if using "aoai".
+  # API_TYPE: "aoai" , # The API type, "openai" for the OpenAI API, "aoai" for the Azure OpenAI.  
+  # API_BASE: "YOUR_ENDPOINT", # The the OpenAI API endpoint, "https://api.openai.com/v1/chat/completions" for the OpenAI API. As for the aoai, it should be https://{your-resource-name}.openai.azure.com
+  # API_KEY: "YOUR_KEY",  # The aoai API key
+  # API_VERSION: "2024-02-15-preview", # "2024-02-15-preview" by default
+  # API_MODEL: "YOUR_MODEL",  # The only OpenAI model by now that accepts visual input
+  # API_DEPLOYMENT_ID: "gpt-4-visual-preview", # The deployment id for the AOAI API
+
+  ### For Azure_AD
+  # AAD_TENANT_ID: "YOUR_TENANT_ID", # Set the value to your tenant id for the llm model
+  # AAD_API_SCOPE: "YOUR_SCOPE", # Set the value to your scope for the llm model
+  # AAD_API_SCOPE_BASE: "YOUR_SCOPE_BASE" # Set the value to your scope base for the llm model, whose format is API://YOUR_SCOPE_BASE, and the only need is the YOUR_SCOPE_BASE
+}
 
-CONTROL_BACKEND: "uia"  # The backend for control action
-MAX_STEP: 30  # The max step limit for completing the user request
-SLEEP_TIME: 5  # The sleep time between each step to wait for the window to be ready
-SAFE_GUARD: True  # Whether to use the safe guard to prevent the model from doing sensitve operations.
-CONTROL_TYPE_LIST: ["Button", "Edit", "TabItem", "Document", "ListItem", "MenuItem", "ScrollBar", "TreeItem", "Hyperlink", "ComboBox", "RadioButton"]  # The list of control types that are allowed to be selected 
-HISTORY_KEYS: ["Step", "Thought", "ControlText", "Action", "Comment", "Results"]  # The keys of the action history for the next step.
-ANNOTATION_COLORS: {
-        "Button": "#FFF68F",
-        "Edit": "#A5F0B5",
-        "TabItem": "#A5E7F0",
-        "Document": "#FFD18A",
-        "ListItem": "#D9C3FE",
-        "MenuItem": "#E7FEC3",
-        "ScrollBar": "#FEC3F8",
-        "TreeItem": "#D6D6D6",
-        "Hyperlink": "#91FFEB",
-        "ComboBox": "#D8B6D4"
-    }
-
-PRINT_LOG: False  # Whether to print the log  
-CONCAT_SCREENSHOT: True  # Whether to concat the screenshot for the control item
-LOG_LEVEL: "DEBUG"  # The log level
-INCLUDE_LAST_SCREENSHOT: True  # Whether to include the last screenshot in the observation
-REQUEST_TIMEOUT: 250  # The call timeout for the GPT-V model
-
-APP_SELECTION_PROMPT: "ufo/prompts/base/{mode}/app_selection.yaml"  # The prompt for the app selection
-ACTION_SELECTION_PROMPT: "ufo/prompts/base/{mode}/action_selection.yaml"  # The prompt for the action selection
-
-APP_SELECTION_EXAMPLE_PROMPT: "ufo/prompts/examples/{mode}/app_example.yaml"  # The prompt for the app selection
-ACTION_SELECTION_EXAMPLE_PROMPT: "ufo/prompts/examples/{mode}/action_example.yaml"  # The prompt for the action selection
-
-## For experience learning
-EXPERIENCE_PROMPT: "ufo/prompts/experience/{mode}/experience_summary.yaml"
-EXPERIENCE_SAVED_PATH: "vectordb/experience/"
-
-
-API_PROMPT: "ufo/prompts/base/{mode}/api.yaml"  # The prompt for the API
-INPUT_TEXT_API: "type_keys" # The input text API
-INPUT_TEXT_ENTER: True # whether to press enter after typing the text
 
+### For GPT parameters
+MAX_TOKENS: 2000  # The max token limit for the response completion
+MAX_RETRY: 3  # The max retry limit for the response completion
+TEMPERATURE: 0.0  # The temperature of the model: the lower the value, the more consistent the output of the model
+TOP_P: 0.0  # The top_p of the model: the lower the value, the more conservative the output of the model
+TIMEOUT: 60  # The call timeout(s), default is 10 mins
+
+
+### For RAG
 
 ## RAG Configuration for the offline docs
 RAG_OFFLINE_DOCS: False  # Whether to use the offline RAG.
 RAG_OFFLINE_DOCS_RETRIEVED_TOPK: 1  # The topk for the offline retrieved documents
 
-
 ## RAG Configuration for the Bing search
+BING_API_KEY: "YOUR_BING_SEARCH_API_KEY"  # The Bing search API key
 RAG_ONLINE_SEARCH: False  # Whether to use the online search for the RAG.
 RAG_ONLINE_SEARCH_TOPK: 5  # The topk for the online search
 RAG_ONLINE_RETRIEVED_TOPK: 1 # The topk for the online retrieved documents
 
-
 ## RAG Configuration for experience
-RAG_EXPERIENCE: True  # Whether to use the offline RAG.
+RAG_EXPERIENCE: True  # Whether to use the RAG from its self-experience.
 RAG_EXPERIENCE_RETRIEVED_TOPK: 5  # The topk for the offline retrieved documents
+
+
+
+
+
diff --git a/ufo/config/config_dev.yaml b/ufo/config/config_dev.yaml
@@ -0,0 +1,38 @@
+CONTROL_BACKEND: "uia"  # The backend for control action
+MAX_STEP: 30  # The max step limit for completing the user request
+SLEEP_TIME: 5  # The sleep time between each step to wait for the window to be ready
+SAFE_GUARD: True  # Whether to use the safe guard to prevent the model from doing sensitve operations.
+CONTROL_TYPE_LIST: ["Button", "Edit", "TabItem", "Document", "ListItem", "MenuItem", "ScrollBar", "TreeItem", "Hyperlink", "ComboBox", "RadioButton"]  # The list of control types that are allowed to be selected 
+HISTORY_KEYS: ["Step", "Thought", "ControlText", "Action", "Comment", "Results"]  # The keys of the action history for the next step.
+ANNOTATION_COLORS: {
+        "Button": "#FFF68F",
+        "Edit": "#A5F0B5",
+        "TabItem": "#A5E7F0",
+        "Document": "#FFD18A",
+        "ListItem": "#D9C3FE",
+        "MenuItem": "#E7FEC3",
+        "ScrollBar": "#FEC3F8",
+        "TreeItem": "#D6D6D6",
+        "Hyperlink": "#91FFEB",
+        "ComboBox": "#D8B6D4"
+    }
+
+PRINT_LOG: False  # Whether to print the log  
+CONCAT_SCREENSHOT: True  # Whether to concat the screenshot for the control item
+LOG_LEVEL: "DEBUG"  # The log level
+INCLUDE_LAST_SCREENSHOT: True  # Whether to include the last screenshot in the observation
+REQUEST_TIMEOUT: 250  # The call timeout for the GPT-V model
+
+APP_SELECTION_PROMPT: "ufo/prompts/base/{mode}/app_selection.yaml"  # The prompt for the app selection
+ACTION_SELECTION_PROMPT: "ufo/prompts/base/{mode}/action_selection.yaml"  # The prompt for the action selection
+
+APP_SELECTION_EXAMPLE_PROMPT: "ufo/prompts/examples/{mode}/app_example.yaml"  # The prompt for the app selection
+ACTION_SELECTION_EXAMPLE_PROMPT: "ufo/prompts/examples/{mode}/action_example.yaml"  # The prompt for the action selection
+
+## For experience learning
+EXPERIENCE_PROMPT: "ufo/prompts/experience/{mode}/experience_summary.yaml"
+EXPERIENCE_SAVED_PATH: "vectordb/experience/"
+
+API_PROMPT: "ufo/prompts/base/{mode}/api.yaml"  # The prompt for the API
+INPUT_TEXT_API: "type_keys" # The input text API
+INPUT_TEXT_ENTER: True # whether to press enter after typing the text