Skip to content
/ UFO Public

Official implementation of ๐Ÿ›ธ "UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface"

Notifications You must be signed in to change notification settings

nnnth/UFO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

14 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Unifying Fine-grained Perception into MLLMs w/o Task Decoders. 16 tokens enable precise segmentation

hf_paper arXiv License Hits GitHub issues GitHub closed issues

This repo is the official implementation of paper: ๐Ÿ›ธ UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface as well as the follow-ups. We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

Hao Tang, Chenwei Xie , Haiyang Wang, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng$^\dagger$, Liwei Wang $^\dagger$

๐Ÿ“ฃ News

  • [25-3-12] We release separate repos of UFO-InternVL2-8B and add REC inference on InternVL repo.
  • [25-3-4] ๐Ÿš€ Training and inference Code is released.
  • [25-3-3] ๐Ÿ‘€ UFO is released on arXiv.

Overview

๐Ÿ‘€ Todo

  • Release the arXiv version.
  • Release code and models of multi-task training on UFO-ViT.
  • Release code and models of fine-grained instruction tuning on UFO-InternVL2-8B and UFO-LLaVA-1.5-7B.
  • Release full code and models of multi-task training on UFO-InternVL2-8B.

๐Ÿค” Introduction

Previous efforts to introduce fine-grained perception tasks into MLLMs rely heavily on task-specific decoders or suboptimal formats (e.g., polygons), impeding the visual unified modeling. To overcome this, we propose UFO:

  • ๐Ÿ˜ฎ We reformulate segmentation as embedding retrieval, where the mask token embedding computes similarity with image features by dot product, retrieving high-similarity positions to generate the mask.

  • ๐Ÿš€ We first explore the image representation capabilities of MLLMs. We argue that since MLLMs excel in understanding, the mask information is also in the image features and we just need to retrieve it.

  • ๐Ÿค— Fully aligned with open-ended Language interface: UFO unifies detection and segmentation through the open-ended language interface without any additional decoders, enabling seamless integration with MLLMs.

  • ๐Ÿ”ฅ Competitive performance: UFO surpasses GiT, a text-based generalist model, by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K. It also matches or exceeds decoder-based methods in various grounding tasks, eliminating the need for task-specific decoders.

๐Ÿš€ Main Results

Single-Task Benchmark

Model Params Metric Perfomance ckpt config
UFO-ViT-Bdetection 131M mAP 47.8 ckpt config
UFO-ViT-Binsseg 131M mAP 42.6 ckpt config
UFO-ViT-Bsemseg 131M mIoU 49.5 ckpt config
UFO-ViT-Bcaption 131M BLEU-4 34.2 ckpt config
UFO-ViT-Bgrounding 131M [email protected] 83.6 ckpt config

Multi-Task Benchmark

Model Params Detection Ins Seg Sem Seg Caption Grounding ckpt config
UFO-ViT-Bmulti-task 131M 48.3 43.5 50.2 35.3 85.8 ckpt config
UFO-ViT-Lmulti-task 387M 52.9 47.3 54.0 35.9 88.5 ckpt config
UFO-ViT-Hmulti-task 756M 54.1 48.1 55.7 37.6 89.2 ckpt config

Task Synergy in Multi-Tasking Training

Model Params Detection Ins Seg Sem Seg Caption Grounding
UFO-Bsingle-task 131M 47.8 42.6 49.5 34.2 83.6
Improvement +0.5 +0.9 +0.7 +1.1 +2.2
UFO-Bmulti-task 131M 48.3 43.5 50.2 35.3 85.8

MLLM Performance on Multi-Task Benchmark

UFO-InternVL2-8B:

Resolution Detection Ins Seg Sem Seg Caption Grounding ckpt config
448x448 44.0 37.4 53.9 39.6 90.4 ckpt config
896x896 50.9 43.6 54.6 - - ckpt config
1344x1344 51.9 45.2 - - - ckpt config

Visual Grounding

RefCOCO Validation Set

Model REC RES ckpt config
UFO-LLaVA-1.5-7B 89.9 76.2 ckpt config
UFO-LLaVA-1.5-7B (ft) 90.8 77.2 ckpt config
UFO-InternVL2-8B 90.7 77.3 ckpt config
UFO-InternVL2-8B (ft) 91.4 78.0 ckpt config

Reasoning Segmentation

Model Overall Short Query Long Query ckpt config
UFO-LLaVA-1.5-7B 53.8 40.1 58.2 ckpt config
UFO-LLaVA-1.5-7B (ft) 58.0 46.3 61.7 ckpt config
UFO-InternVL2-8B 55.4 41.9 59.8 ckpt config
UFO-InternVL2-8B (ft) 61.2 49.6 64.9 ckpt config

๐Ÿ› ๏ธ Quick Start

Installation

conda create -n UFO python=3.11

conda activate UFO

pip install torch==2.1.0+cu118 torchvision==0.16.0+cu118 -f https://download.pytorch.org/whl/torch_stable.html
pip install -U openmim
mim install "mmengine==0.8.3"
mim install "mmcv==2.1.0"
pip install "transformers==4.37.2"

git clone [email protected]:nnnth/UFO.git
cd UFO

pip install -v -e .
pip install -r requirements/optional.txt
pip install -r requirements/runtime.txt
  • (Optional) Install Java manually for image caption evaluation. Without Java, you can train image caption normally, but fail in caption evaluation.

Dataset Preparation

Multi-Tasking Dataset

We follow GiT to prepare the multi-task datasets. Please refer here for more details.

Instruction Tuning Dataset

We use 24 datasets in for instruction tuning on MLLMs. For more details, please refer here.

Download Pretraining Weight

We use LLaVA-1.5-7B and InternVL2-8B as MLLM pretraining. In multi-task training on UFO-ViT, we also use Bert Tokenizer and Bert Embeddings. Please download and organize them as follows:

UFO
|โ”€โ”€ckpt
|โ”€โ”€|โ”€โ”€llava-1.5-7b-hf
|โ”€โ”€|โ”€โ”€InternVL2-8B
|โ”€โ”€|โ”€โ”€bert-base-uncased
|โ”€โ”€|โ”€โ”€bert_embed_womask.pt
|โ”€โ”€|โ”€โ”€bert_embed.pt
|โ”€โ”€|โ”€โ”€bert_embed_large.pt
|โ”€โ”€|โ”€โ”€bert_embed_huge.pt

For InternVL2-8B, we add a custom function for lora training. Please replace the original file following issue.

Demo

Please download checkpoints from kanashi6/UFO, then save them under root dir:

UFO
|โ”€โ”€ufo-vit-b-single-det.pth
|โ”€โ”€ufo-vit-b-single-insseg.pth
|โ”€โ”€...

Run demo on detection (coco):

python demo.py --img_path demo/demo.jpg --config configs/UFO-ViT/single_detection_base.py \
  --ckpt_path ./ufo-vit-b-single-det.pth --out_dir ./vis/ --task detection

Run demo on RES:

python demo.py --img_path demo/demo.jpg --config configs/UFO-InternVL2-8B/internvl2_8b_res_ft_2w.py \
  --ckpt_path ./ufo-internvl2-8b-res.pth --out_dir ./vis/ --task res --text bench

Scripts

For Training and evaluation commands, please refer here.

๐Ÿ‘ Acknowledgement

  • MMDetection The codebase we built upon. Thanks for providing such a convenient framework.
  • GiT We use the multi-task benchmark established by GiT.
  • InternVL We borrow codes of MLLMs from InternVL repo.

๐Ÿ“˜ Citation

Please consider citing our work as follows if it is helpful.

@article{tang2025ufo,
    title={UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface},
    author={ Hao Tang, Chenwei Xie, Haiyang Wang, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, Liwei Wang},
    journal={arXiV:2503.01342},
    year={2025}
}

โœจ Star History

Star History Chart

About

Official implementation of ๐Ÿ›ธ "UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published