Skip to content

OSU-NLP-Group/UGround

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

UGround

This is the official code repository for the project: Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. image

Updates

  • 2025/01/07: UGround-V1-72B-Preview is out. Updated evaluation results in Main Results.

  • 2025/01/05: Qwen2-VL-based UGround-v1 acheives SOTA results on a new and comprehensive GUI grounding benchmark ScreenSpot-Pro, substaintially outperforms prior models (18.9->31.1). Check the results and our tweet.

  • 2025/01/03: Qwen2-VL-based UGround-v1 has been released (2B & 7B). Check thier performance in Main Results.

  • 2024/10/07: Preprint is arXived. Demo is live. Code coming soon.

  • 2024/08/06: Website is live. The initial manuscript and results are available.

Release Plans:

  • Model Weights
    • Initial V1 (the one used in the paper)
    • Qwen2-VL-based V1
      • 2B
      • 7B
      • 72B
    • V1.1
      • 2B
      • 7B
      • 72B
  • Code
    • Inference Code of UGround
    • Offline Experiments
      • Screenspot (along with referring expressions generated by GPT-4/4o)
      • Multimodal-Mind2Web
      • OmniAct
      • Android Control
    • Online Experiments
      • Mind2Web-Live-SeeAct-V
      • AndroidWorld-SeeAct-V
  • Data-V1
    • Data Examples
    • Data Construction Scripts
    • Guidance of Open-source Data
  • Data-V1.1
    • Data Mixture
  • Online Demo (HF Spaces)

Main Results

GUI Visual Grounding: ScreenSpot (Standard Setting)

image

ScreenSpot (Standard) Arch SFT data Mobile-Text Mobile-Icon Desktop-Text Desktop-Icon Web-Text Web-Icon Avg
Groma Groma 10.3 2.6 4.6 4.3 5.7 3.4 5.2
Qwen-VL Qwen-VL 9.5 4.8 5.7 5.0 3.5 2.4 5.2
MiniGPT-v2 MiniGPT-v2 8.4 6.6 6.2 2.9 6.5 3.4 5.7
GPT-4 22.6 24.5 20.2 11.8 9.2 8.8 16.2
GPT-4o 20.2 24.9 21.1 23.6 12.2 7.8 18.3
Fuyu Fuyu 41.0 1.3 33.0 3.6 33.9 4.4 19.5
Qwen-GUI Qwen-VL GUICourse 52.4 10.9 45.9 5.7 43.0 13.6 28.6
Qwen2-VL Qwen2-VL 61.3 39.3 52.0 45.0 33.0 21.8 42.1
SeeClick Qwen-VL SeeClick 78.0 52.0 72.2 30.0 55.7 32.5 53.4
OS-Atlas-Base-4B InternVL OS-Atlas 85.7 58.5 72.2 45.7 82.6 63.1 68.0
UGround-V1 LLaVA-UGround-V1 UGround-V1 82.8 60.3 82.5 63.6 80.4 70.4 73.3
Iris Iris SeeClick 85.3 64.2 86.7 57.5 82.6 71.2 74.6
ShowUI-G ShowUI ShowUI 91.6 69.0 81.8 59.0 83.0 65.5 75.0
ShowUI ShowUI ShowUI 92.3 75.5 76.3 61.1 81.7 63.6 75.1
UGround-V1-2B (Qwen2-VL) Qwen2-VL UGround-V1 89.4 72.0 88.7 65.7 81.3 68.9 77.7
Aguvis-G-7B Qwen2-VL Aguvis-Stage-1 88.3 78.2 88.1 70.7 85.7 74.8 81.0
OS-Atlas-Base-7B Qwen2-VL OS-Atlas 93.0 72.9 91.8 62.9 90.9 74.3 81.0
Aria-UI Aria Aria-UI 92.3 73.8 93.3 64.3 86.5 76.2 81.1
Aguvis-7B Qwen2-VL Aguvis-Stage-1&2 95.6 77.7 93.8 67.1 88.3 75.2 83.0
UGround-V1-7B (Qwen2-VL) Qwen2-VL UGround-V1 93.0 79.9 93.8 76.4 90.9 84.0 86.3
AGUVIS-72B Qwen2-VL Aguvis-Stage-1&2 94.5 85.2 95.4 77.9 91.3 85.9 88.4
UGround-V1-72B-Preview Qwen2-VL UGround-V1 94.5 82.1 95.9 82.9 93.0 85.9 89.2

GUI Visual Grounding: ScreenSpot (Agent Setting)

Planner Agent-Screenspot arch SFT data Mobile-Text Mobile-Icon Desktop-Text Desktop-Icon Web-Text Web-Icon Avg
GPT-4o Qwen-VL Qwen-VL 21.3 21.4 18.6 10.7 9.1 5.8 14.5
GPT-4o Qwen-GUI Qwen-VL GUICourse 67.8 24.5 53.1 16.4 50.4 18.5 38.5
GPT-4o SeeClick Qwen-VL SeeClick 81.0 59.8 69.6 33.6 43.9 26.2 52.4
GPT-4o OS-Atlas-Base-4B InternVL-2 OS-Atlas 94.1 73.8 77.8 47.1 86.5 65.3 74.1
GPT-4o OS-Atlas-Base-7B Qwen2-VL OS-Atlas 93.8 79.9 90.2 66.4 92.6 79.1 83.7
GPT-4o UGround-V1 LLaVA-UGround-V1 UGround-V1 93.4 76.9 92.8 67.9 88.7 68.9 81.4
GPT-4o UGround-V1-2B (Qwen2-VL) Qwen2-VL UGround-V1 94.1 77.7 92.8 63.6 90.0 70.9 81.5
GPT-4o UGround-V1-7B (Qwen2-VL) Qwen2-VL UGround-V1 94.1 79.9 93.3 73.6 89.6 73.3 84.0

Inference of Qwen2-VL-Based UGround

vLLM server

vllm serve osunlp/UGround-V1-7B  --api-key token-abc123 --dtype float16

or

python -m vllm.entrypoints.openai.api_server --served-model-name osunlp/UGround-V1-7B --model osunlp/UGround-V1-7B --dtype float16 

You can find more instruction about training and inference in Qwen2-VL's Official Repo.

Here we use float16 instead of bfloat16 for more stable decoding (See details in vLLM's doc)

Visual Grounding Prompt

def format_openai_template(description: str, base64_image):
    return [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                },
                {
                    "type": "text",
                    "text": f"""
  Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.

  - Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
  - If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
  - Your answer should be a single string (x, y) corresponding to the point of the interest.

  Description: {description}

  Answer:"""
                },
            ],
        },
    ]


messages = format_openai_template(description, base64_image)

completion = await client.chat.completions.create(
    model=args.model_path,
    messages=messages,
    temperature=0  # REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
)

# The output will be in the range of [0,1000), which is compatible with the original Qwen2-VL
# So the actual coordinates should be (x/1000*width, y/1000*height)

Untitled design

Citation Information

If you find this work useful, please consider starring our repo and citing our papers:

@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }