Skip to content

Official implementation of paper VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

License

Notifications You must be signed in to change notification settings

yellow-binary-tree/MMDuet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MMDuet

Static Badge Static Badge arXiv

Official implementation of paper VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Introduction

Watch video on Youtube

Video also available on Bilibili (゜-゜)つロ干杯~

MMDuet is a VideoLLM implemented in the video-text duet interaction format, which treats the video stream as a role in the conversation akin to the user and the assistant. Under this interaction format, the video is continuously played and input to the model frame-by-frame. Both the user and model can insert their text messages right after any frame during the video play. When a text message ends, the video continues to play, akin to the show of two performers in a duet.

This not only ensures a timely response for video comprehension, but also improves the performance on many time-sensitive video-text multimodal tasks, such as temporal video grounding, highlight detection, and dense video captioning.

Installation

  1. Create conda environment and use pip to install some packages
pip clone https://github.com/yellow-binary-tree/MMDuet
cd MMDuet

conda create -n mmduet python=3.10
conda activate mmduet
pip install --upgrade pip
pip install -r requirements.txt
  1. Install llava following the instructions in https://github.com/LLaVA-VL/LLaVA-NeXT
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
pip install -e ".[train]"
  1. Install flash-attention following the instructions in https://github.com/Dao-AILab/flash-attention. If you have difficulties installing it, add --attn_implementation sdpa in every command to use the sdpa implementation of transformer attention for train or inference.

  2. Download MMDuet checkpoints from HuggingFace: https://huggingface.co/wangyueqian/MMDuet and put the files under folder ./outputs/mmduet.

Demo

To launch a Gradio demo: python -m demo.app --lora_pretrained outputs/mmduet

Inference

Download model and data

Inference and evaluation

Scripts to inference on all benchmarks are listed in ./scripts/inference/.

WARNING: Each script file contains many steps for inference and evaluation. DO NOT directly run these script files. Instead, read the contents of these files carefully and run them step by step.

  • YouCook2 dense video captioning: ./scripts/inference/youcook2.sh
  • Shot2Story-MAGQA-39k multi-answer grounded video question answering (MAGQA): ./scripts/inference/magqa.sh
    • Note: To save compute, we do not calculate the similarity score between the pred answer and the gold answer if the pred time is not in the gold timespan. We simply set this score to 1 in the score matrix of evaluator_output. These scores are not used in calculating and do not affect the final metric (in-span score).
  • Charades-STA temporal video grounding: ./scripts/inference/charades.sh
  • QVHighlights highlight detection: ./scripts/inference/qvh.sh

Training

Run ./scripts/train.sh.

When running training code for the first time, the dataset code will traverse all videos of the training dataset and stat the frame rate, duration and number of frames of the videos, and store this information in datasets/${dataset_name}/videos_metadata.json. This can take quite a long time. Considering that videos downloaded from different sources may be slightly different, in order to ensure that the videos are correctly loaded, we do not include this metadata information in our data release.

Acknowledgment

The following projects has been of great help to this work:

  • VideoLLM-online for providing codebase we built upon,
  • LLaVA-NeXT for providing awesome multi-modal foundation models,
  • Shot2Story for providing high-quality clip-level video captions.

Citation

If you find this work useful in your research, please consider citing:

@misc{wang2024mmduet,
      title={VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format}, 
      author={Yueqian Wang and Xiaojun Meng and Yuxuan Wang and Jianxin Liang and Jiansheng Wei and Huishuai Zhang and Dongyan Zhao},
      year={2024},
      eprint={2411.17991},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.17991}, 
}

About

Official implementation of paper VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published