Skip to content

The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"

Notifications You must be signed in to change notification settings

DAMO-NLP-SG/VideoRefer

Repository files navigation

Static Badge arXiv preprint Dataset Model Benchmark

video Homepage

VideoRefer can understand any object you're interested within a video.

📰 News

  • [2025.1.1] We Release the code of VideoRefer and the VideoRefer-Bench.

🎥 Video

videorefer-demo.mp4
  • HD video can be viewed on YouTube.

🔍 About VideoRefer Suite

VideoRefer Suite is designed to enhance the fine-grained spatial-temporal understanding capabilities of Video Large Language Models (Video LLMs). It consists of three primary components:

  • Model (VideoRefer)

VideoRefer is an effective Video LLM, which enables fine-grained perceiving, reasoning, and retrieval for user-defined regions at any specified timestamps—supporting both single-frame and multi-frame region inputs.

  • Dataset (VideoRefer-700K)

VideoRefer-700K is a large-scale, high-quality object-level video instruction dataset. Curated using a sophisticated multi-agent data engine to fill the gap for high-quality object-level video instruction data.

  • Benchmark (VideoRefer-Bench)

VideoRefer-Bench is a comprehensive benchmark to evaluate the object-level video understanding capabilities of a model, which consists of two sub-benchmarks: VideoRefer-Bench-D and VideoRefer-Bench-Q.

🛠️ Requirements and Installation

Basic Dependencies:

  • Python >= 3.8
  • Pytorch >= 2.2.0
  • CUDA Version >= 11.8
  • transformers == 4.40.0 (for reproducing paper results)
  • tokenizers == 0.19.1

Install required packages:

git clone https://github.com/DAMO-NLP-SG/VideoRefer
cd VideoRefer
pip install -r requirements.txt
pip install flash-attn==2.5.8 --no-build-isolation

🌟 Getting started

Please refer to the examples in infer.ipynb for detailed instructions on how to use our model for single video inference, which supports both single-frame and multi-frame modes.

For better usage, the demo integrates with SAM2, to get started, please install SAM2 first:

git clone https://github.com/facebookresearch/sam2.git && cd sam2

SAM2_BUILD_CUDA=0 pip install -e ".[notebooks]"

Then, download sam2.1_hiera_large.pt to checkpoints.

🗝️ Training & Evaluation

Training

The training data and data structure can be found in Dataset preparation.

The training pipeline of our model is structured into four distinct stages.

  • Stage1: Image-Text Alignment Pre-training

  • Stage2: Region-Text Alignment Pre-training

    • Prepare datasets used for stage2.
    • Run bash scripts/train/stage2.sh.
  • Stage2.5: High-Quality Knowledge Learning

    • Prepare datasets used for stage2.5.
    • Run bash scripts/train/stage2.5.sh.
  • Stage3: Visual Instruction Tuning

    • Prepare datasets used for stage3.
    • Run bash scripts/train/stage3.sh.

Evaluation

For model evaluation, please refer to eval.

🌏 Model Zoo

Model Name Visual Encoder Language Decoder # Training Frames
VideoRefer-7B siglip-so400m-patch14-384 Qwen2-7B-Instruct 16
VideoRefer-7B-stage2 siglip-so400m-patch14-384 Qwen2-7B-Instruct 16
VideoRefer-7B-stage2.5 siglip-so400m-patch14-384 Qwen2-7B-Instruct 16

🕹️ VideoRefer-Bench

VideoRefer-Bench assesses the models in two key areas: Description Generation, corresponding to VideoRefer-BenchD, and Multiple-choice Question-Answer, corresponding to VideoRefer-BenchQ.

videorefer-bench.mp4
  • The annotations of the benchmark can be found in 🤗benchmark.

  • The usage of VideoRefer-Bench is detailed in doc.

📑 Citation

If you find VideoRefer Suite useful for your research and applications, please cite using this BibTeX:

@article{yuan2025videorefersuite,
  title = {VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM},
  author = {Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing},
  journal={arXiv},
  year={2025},
  url = {http://arxiv.org/abs/2501.00599}
}
💡 Some other multimodal-LLM projects from our team may interest you ✨.

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, Lidong Bing
github github arXiv

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
github github arXiv

Osprey: Pixel Understanding with Visual Instruction Tuning
Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu
github github arXiv

👍 Acknowledgement

The codebase of VideoRefer is adapted from VideoLLaMA 2. The visual encoder and language decoder we used in VideoRefer are Siglip and Qwen2, respectively.

About

The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •