v2 pre-release; merge demo

brucemeek · Feb 13, 2025 · fd9db1a · fd9db1a
2 parents dcca70c + f612ddb
commit fd9db1a
Show file tree

Hide file tree

Showing 67 changed files with 6,901 additions and 2,192 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,8 +1,12 @@
-weights/icon_caption_blip2
-weights/icon_caption_florence
-weights/icon_detect/
-weights/icon_detect_v1_5/
-weights/icon_detect_v1_5_2/
-.gradio
-__pycache__/
-debug.ipynb
+weights/icon_caption_blip2
+weights/icon_caption_florence
+weights/icon_detect/
+weights/icon_detect_v1_5/
+weights/icon_detect_v1_5_2/
+.gradio
+__pycache__/
+debug.ipynb
+util/__pycache__/
+index.html?linkid=2289031
+wget-log
+weights/icon_caption_florence_v2/
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -7,11 +7,13 @@
 [![arXiv](https://img.shields.io/badge/Paper-green)](https://arxiv.org/abs/2408.00203)
 [![License](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 
-📢 [[Project Page](https://microsoft.github.io/OmniParser/)] [[Blog Post](https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/)] [[Models](https://huggingface.co/microsoft/OmniParser)] [huggingface space](https://huggingface.co/spaces/microsoft/OmniParser)
+📢 [[Project Page](https://microsoft.github.io/OmniParser/)] [[Blog Post](https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/)] [[Models V2](https://huggingface.co/microsoft/OmniParser-v2.0)] [[Models](https://huggingface.co/microsoft/OmniParser)] [[huggingface space](https://huggingface.co/spaces/microsoft/OmniParser)]
 
 **OmniParser** is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. 
 
 ## News
+- [2025/2] We release V2 [checkpoints](https://huggingface.co/microsoft/OmniParser-v2.0) 
+- [2025/2] We introduce OmniTool: Control a Windows 11 VM with OmniParser + your vision model of choice. OmniTool supports out of the box the following large language models - OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL) or Anthropic Computer Use. 
 - [2025/1] V2 is coming. We achieve new state of the art results 39.5% on the new grounding benchmark [Screen Spot Pro](https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding/tree/main) with OmniParser v2 (will be released soon)! Read more details [here](https://github.com/microsoft/OmniParser/tree/master/docs/Evaluation.md).
 - [2024/11] We release an updated version, OmniParser V1.5 which features 1) more fine grained/small icon detection, 2) prediction of whether each screen element is interactable or not. Examples in the demo.ipynb. 
 - [2024/10] OmniParser was the #1 trending model on huggingface model hub (starting 10/29/2024). 
@@ -27,6 +29,13 @@ conda activate omni
 pip install -r requirements.txt
 ```
 
+Ensure you have the V2 weights downloaded in weights folder (ensure caption weights folder is called icon_caption_florence). If not download them with:
+```
+   rm -rf weights/icon_detect weights/icon_caption weights/icon_caption_florence 
+   for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
+   mv weights/icon_caption weights/icon_caption_florence
+```
+<!-- ## [deprecated]
 Then download the model ckpts files in: https://huggingface.co/microsoft/OmniParser, and put them under weights/, default folder structure is: weights/icon_detect, weights/icon_caption_florence, weights/icon_caption_blip2. 
 
 For v1: 
@@ -36,18 +45,15 @@ python weights/convert_safetensor_to_pt.py
 
 For v1.5: 
 download 'model_v1_5.pt' from https://huggingface.co/microsoft/OmniParser/tree/main/icon_detect_v1_5, make a new dir: weights/icon_detect_v1_5, and put it inside the folder. No weight conversion is needed. 
-```
+``` -->
 
 ## Examples:
 We put together a few simple examples in the demo.ipynb. 
 
 ## Gradio Demo
 To run gradio demo, simply run:
 ```python
-# For v1
-python gradio_demo.py --icon_detect_model weights/icon_detect/best.pt --icon_caption_model florence2
-# For v1.5
-python gradio_demo.py --icon_detect_model weights/icon_detect_v1_5/model_v1_5.pt --icon_caption_model florence2
+python gradio_demo.py
 ```
 
 ## Model Weights License

diff --git a/SECURITY.md b/SECURITY.md
@@ -1,41 +1,41 @@
-<!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK -->
-
-## Security
-
-Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet) and [Xamarin](https://github.com/xamarin).
-
-If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/security.md/definition), please report it to us as described below.
-
-## Reporting Security Issues
-
-**Please do not report security vulnerabilities through public GitHub issues.**
-
-Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/security.md/msrc/create-report).
-
-If you prefer to submit without logging in, send email to [[email protected]](mailto:[email protected]).  If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/security.md/msrc/pgp).
-
-You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). 
-
-Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
-
-  * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
-  * Full paths of source file(s) related to the manifestation of the issue
-  * The location of the affected source code (tag/branch/commit or direct URL)
-  * Any special configuration required to reproduce the issue
-  * Step-by-step instructions to reproduce the issue
-  * Proof-of-concept or exploit code (if possible)
-  * Impact of the issue, including how an attacker might exploit the issue
-
-This information will help us triage your report more quickly.
-
-If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/security.md/msrc/bounty) page for more details about our active programs.
-
-## Preferred Languages
-
-We prefer all communications to be in English.
-
-## Policy
-
-Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/security.md/cvd).
-
-<!-- END MICROSOFT SECURITY.MD BLOCK -->
+<!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK -->
+
+## Security
+
+Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet) and [Xamarin](https://github.com/xamarin).
+
+If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/security.md/definition), please report it to us as described below.
+
+## Reporting Security Issues
+
+**Please do not report security vulnerabilities through public GitHub issues.**
+
+Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/security.md/msrc/create-report).
+
+If you prefer to submit without logging in, send email to [[email protected]](mailto:[email protected]).  If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/security.md/msrc/pgp).
+
+You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). 
+
+Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
+
+  * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
+  * Full paths of source file(s) related to the manifestation of the issue
+  * The location of the affected source code (tag/branch/commit or direct URL)
+  * Any special configuration required to reproduce the issue
+  * Step-by-step instructions to reproduce the issue
+  * Proof-of-concept or exploit code (if possible)
+  * Impact of the issue, including how an attacker might exploit the issue
+
+This information will help us triage your report more quickly.
+
+If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/security.md/msrc/bounty) page for more details about our active programs.
+
+## Preferred Languages
+
+We prefer all communications to be in English.
+
+## Policy
+
+Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/security.md/cvd).
+
+<!-- END MICROSOFT SECURITY.MD BLOCK -->
diff --git a/demo.ipynb b/demo.ipynb
diff --git a/gradio_demo.py b/gradio_demo.py
@@ -8,12 +8,13 @@
 
 
 import base64, os
-from utils import check_ocr_box, get_yolo_model, get_caption_model_processor, get_som_labeled_img
+from util.utils import check_ocr_box, get_yolo_model, get_caption_model_processor, get_som_labeled_img
 import torch
 from PIL import Image
-import argparse
-
 
+yolo_model = get_yolo_model(model_path='weights/icon_detect/model.pt')
+caption_model_processor = get_caption_model_processor(model_name="florence2", model_name_or_path="weights/icon_caption_florence")
+# caption_model_processor = get_caption_model_processor(model_name="blip2", model_name_or_path="weights/icon_caption_blip2")
 
 MARKDOWN = """
 # OmniParser for Pure Vision Based General GUI Agent 🔥
@@ -36,9 +37,9 @@ def process(
     box_threshold,
     iou_threshold,
     use_paddleocr,
-    imgsz,
-    icon_process_batch_size,
+    imgsz
 ) -> Optional[Image.Image]:
+
     image_save_path = 'imgs/saved_image_demo.png'
     image_input.save(image_save_path)
     image = Image.open(image_save_path)
@@ -54,27 +55,13 @@ def process(
     ocr_bbox_rslt, is_goal_filtered = check_ocr_box(image_save_path, display_img = False, output_bb_format='xyxy', goal_filtering=None, easyocr_args={'paragraph': False, 'text_threshold':0.9}, use_paddleocr=use_paddleocr)
     text, ocr_bbox = ocr_bbox_rslt
     # print('prompt:', prompt)
-    dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(image_save_path, yolo_model, BOX_TRESHOLD = box_threshold, output_coord_in_ratio=True, ocr_bbox=ocr_bbox,draw_bbox_config=draw_bbox_config, caption_model_processor=caption_model_processor, ocr_text=text,iou_threshold=iou_threshold, imgsz=imgsz, batch_size=icon_process_batch_size)  
+    dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(image_save_path, yolo_model, BOX_TRESHOLD = box_threshold, output_coord_in_ratio=True, ocr_bbox=ocr_bbox,draw_bbox_config=draw_bbox_config, caption_model_processor=caption_model_processor, ocr_text=text,iou_threshold=iou_threshold, imgsz=imgsz,)  
     image = Image.open(io.BytesIO(base64.b64decode(dino_labled_img)))
     print('finish processing')
-    # parsed_content_list = '\n'.join(parsed_content_list)
-    parsed_content_list = '\n'.join([f'type: {x['type']}, content: {x["content"]}, interactivity: {x["interactivity"]}' for x in parsed_content_list])
+    parsed_content_list = '\n'.join([f'icon {i}: ' + str(v) for i,v in enumerate(parsed_content_list)])
+    # parsed_content_list = str(parsed_content_list)
     return image, str(parsed_content_list)
 
-
-parser = argparse.ArgumentParser(description='Process model paths and names.')
-parser.add_argument('--icon_detect_model', type=str, required=True, default='weights/icon_detect/best.pt', help='Path to the YOLO model weights')
-parser.add_argument('--icon_caption_model', type=str, required=True, default='florence2',  help='Name of the caption model')
-
-args = parser.parse_args()
-icon_detect_model, icon_caption_model = args.icon_detect_model, args.icon_caption_model
-
-yolo_model = get_yolo_model(model_path=icon_detect_model)
-if icon_caption_model == 'florence2':
-    caption_model_processor = get_caption_model_processor(model_name="florence2", model_name_or_path="weights/icon_caption_florence")
-elif icon_caption_model == 'blip2':
-    caption_model_processor = get_caption_model_processor(model_name="blip2", model_name_or_path="weights/icon_caption_blip2")
-
 with gr.Blocks() as demo:
     gr.Markdown(MARKDOWN)
     with gr.Row():
@@ -88,11 +75,9 @@ def process(
             iou_threshold_component = gr.Slider(
                 label='IOU Threshold', minimum=0.01, maximum=1.0, step=0.01, value=0.1)
             use_paddleocr_component = gr.Checkbox(
-                label='Use PaddleOCR', value=False)
+                label='Use PaddleOCR', value=True)
             imgsz_component = gr.Slider(
-                label='Icon Detect Image Size', minimum=640, maximum=3200, step=32, value=1920)
-            icon_process_batch_size_component = gr.Slider(
-                label='Icon Process Batch Size', minimum=1, maximum=256, step=1, value=64)
+                label='Icon Detect Image Size', minimum=640, maximum=1920, step=32, value=640)
             submit_button_component = gr.Button(
                 value='Submit', variant='primary')
         with gr.Column():
@@ -106,16 +91,10 @@ def process(
             box_threshold_component,
             iou_threshold_component,
             use_paddleocr_component,
-            imgsz_component,
-            icon_process_batch_size_component
+            imgsz_component
         ],
         outputs=[image_output_component, text_output_component]
     )
 
 # demo.launch(debug=False, show_error=True, share=True)
 demo.launch(share=True, server_port=7861, server_name='0.0.0.0')
-
-
-
-# python gradio_demo.py --icon_detect_model weights/icon_detect/best.pt --icon_caption_model florence2
-# python gradio_demo.py --icon_detect_model weights/icon_detect_v1_5/model_v1_5.pt --icon_caption_model florence2
diff --git a/imgs/demo_image.jpg b/imgs/demo_image.jpg
diff --git a/imgs/demo_image_som.jpg b/imgs/demo_image_som.jpg
diff --git a/imgs/gradioicon.png b/imgs/gradioicon.png
diff --git a/imgs/header_bar.png b/imgs/header_bar.png
diff --git a/imgs/header_bar_thin.png b/imgs/header_bar_thin.png
diff --git a/imgs/mobile.png b/imgs/mobile.png
diff --git a/imgs/omniboxicon.png b/imgs/omniboxicon.png
diff --git a/imgs/omniparsericon.png b/imgs/omniparsericon.png
diff --git a/imgs/saved_image_demo.png b/imgs/saved_image_demo.png
diff --git a/imgs/som_overlaid_omni.png b/imgs/som_overlaid_omni.png
diff --git a/omniparser.py b/omniparser.py
diff --git a/omnitool/gradio/.gitignore b/omnitool/gradio/.gitignore
@@ -0,0 +1 @@
+tmp/
diff --git a/weights/train_args.yaml → omnitool/gradio/__init__.py b/weights/train_args.yaml → omnitool/gradio/__init__.py