🌟 This is the official repository for the video large langauge model : Grounded-VideoLLM, a Video-LLM adept at fine-grained temporal grounding. Grounded-VideoLLM not only excels in grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA, but also shows great potential as a versatile video assistant for general video understanding.
💡 We sharpen our model by incorporating:
- An additional temporal stream to encode the relationships between frames.
- Discrete temporal tokens enriched with specific time knowledge to represent timestamps.
- A multi-stage training scheme, beginning with simple video-captioning tasks and progressively introducing video temporal grounding tasks of increasing complexity. To further enhance the temporal reasoning capability, we also curate a grounded VideoQA dataset by an automatic annotation pipeline.
- [2024.10.4] Release the inference scripts and pretrained checkpoints.
- [2024.10.4] Release the annotated grounded-VideoQA dataset .
- [2024.10.4] Release the Phi3.5-Vision-Instruct version.
- [2024.10.29] Release the LLaVA-Next-LLAMA3-8B version, with stronger performance in both grounding tasks and general benchmarks.
- Release the training scripts and training datasets. We will try to adapt more MLLMs as the base model for Grounded-VideoLLM in future.
Model Name | LLM | Charades-STA ([email protected]/[email protected]/[email protected]/mIoU) | ActivityNet-Groudning ([email protected]/[email protected]/[email protected]/mIoU) | ActivityNet-Captions (SODA_c/METEOR) | NEXT-GQA (GQA/mIoP/mIoU) | MVbench | Video-MME (w/o subs) |
---|---|---|---|---|---|---|---|
Grounded-VideoLLM | Phi3.5-3.8B | 54.2/36.4/19.7/36.8 | 46.2/30.3/19.0/36.1 | 6.0/6.8 | 26.7/34.5/21.1 | 59.4 | 47.7 |
Grounded-VideoLLM (*) | Phi3.5-3.8B | 70.2/55.9/33.2/49.4 | 64.9/47.8/30.4/47.2 | 6.6/6.5 | 29.4/37.4/27.0 | 60.0 | 48.1 |
- (*) means we incorporate a sub training set of Charades-STA and ActivityNet into the third training stage. Please refer to our paper for more results.
- Clone this repository and navigate to folder
git clone https://github.com/WHB139426/Grounded-Video-LLM.git
cd Grounded-Video-LLM
- Install Package
conda create -n grounded-videollm python=3.10.14
conda activate grounded-videollm
pip install torch==2.1.2 torchaudio==2.1.2 torchvision==0.16.2 torchdata==0.8.0 # to make sure install torch before flash-attn
pip install -r requirements.txt
pip install numpy==1.26.4 # to make sure numpy<2.0
Some installation suggestions
- We recommend you to pip install
flash-attn==2.3.3
and run the model withtorch.bfloat16
. If your device doesn't support these, you can skip them and replace the argparse parameterattn_implementation
anddtype
ininference.py
, which may result in subtle numerical difference.
Set your own weight_path
to storage the pretrained weights. The folder should be organized as follows:
├── Grounded-Video-LLM
│ └── inference.py
│ └── models
│ └── mm_utils
│ └── training
│ └── scripts
│ └── ...
├── weight_path
│ └── Phi-3.5-mini-instruct
│ └── Phi-3.5-vision-instruct-seperated
│ └── Phi-3.5-vision-instruct
│ └── llama3-llava-next-8b
│ └── llama3-llava-next-8b-seperated
│ └── Meta-Llama-3-8B-Instruct
│ └── ckpt
│ └── internvideo
│ └──...
Download the pretrained weights [🤗HF] in your own weight_path
.
We give a brief example to run the inference code. We recommend GPUs with 24GB memeroy for Phi3.5 version, while 32GB memeroy for LLaVA-Next-LLAMA3-8B version.
- replace the parameter
weight_path
inscripts/inference_phi3_5.sh
orscripts/inference_llama3.sh
with your own weight_path that you set above. - run the command
bash scripts/inference_phi3_5.sh
orbash scripts/inference_llama3.sh
to reproduce the example below:
_3klvlS4W7A.mp4
[USER] Give you a textual query: "The female host wearing purple clothes is reporting news in the studio". When does the described content occur in the video? Please return the start and end timestamps.
[Grounded-VideoLLM] From 14.20 seconds to 25.09 seconds.
[USER] Give you a textual query: "A sign written with 'NO TRESPASSING LOITERING DRUGS'". When does the described content occur in the video? Please return the start and end timestamps.
[Grounded-VideoLLM] From 107.95 seconds to 113.16 seconds.
[USER] What is happening from 70 seconds to 80 seconds?
[Grounded-VideoLLM] A woman with glasses and a red shirt is talking to a reporter.
[USER] Why was the man in green clothes interviewed?
[Grounded-VideoLLM] The man in green clothes was interviewed to provide his perspective on the incident and the history of violence in the apartment complex.
[USER] Question: What does this TV news report about?\nOptions:\n(A) thievery\n(B) community violence incidents\n(C) fashion show\n(D) aging population
[Grounded-VideoLLM] Answer: (B) community violence incidents
- You can change the parameter of
prompt_grounding
,prompt_videoqa
,prompt_referring
andvideo_path
ininference.py
's argparse to run your own case.
We provide the Grounded-VideoQA dataset that we annotated with GPT-4o-mini in [🤗HF]. You can download the videos following [ActivityNet] and [QVHighlights].
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.
@article{wang2024grounded,
title={Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models},
author={Wang, Haibo and Xu, Zhiyang and Cheng, Yu and Diao, Shizhe and Zhou, Yufan and Cao, Yixin and Wang, Qifan and Ge, Weifeng and Huang, Lifu},
journal={arXiv preprint arXiv:2410.03290},
year={2024}
}
We are grateful for the following awesome projects our Grounded-VideoLLM arising from: Prismatic-VLMs, Phi-3.5-vision-instruct, InternVideo2, LLaVA-Next, TimeChat, VTimeLLM, Momentor.