MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Enxin Song*, Wenhao Chai*, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, Gaoang WangβοΈ
arXiv 2023.
MovieChat can handle videos with >10K frames on a 24GB graphics card. MovieChat has a 10000Γ advantage over other methods in terms of the average increase in GPU memory cost per frame (21.3KB/f to ~200MB/f).
- [2023.11.27] π We update the paper with implementation details, technical evaluations, and dataset information.
- [2023.11.23] : We update the latest source code of MovieChat.
- [2023.8.1] π We release the paper.
- [2023.7.31] We release eval code and instraction for short video QA on MSVD-QA, MSRVTT-QA and ActivityNet-QA.
- [2023.7.29] We release Gradio demo of MovieChat.
- [2023.7.22] We release source code of MovieChat.
To better evaluate the performance of MovieChat, we collect a new benchmark for long video understanding tasks, MovieChat-1K, which contains 1K high quality video clips sourced from various movies and TV series with 14K manual annotations.
To the best of our knowledge, a long video understanding dataset has not yet been established. Our work represents the initial step in creating and making it publicly available.We create MovieChat1K, containing 1k long videos and corresponding 1k dense captions, and 13k visual question-answer pairs.For each video, we manually set and provide 1 dense caption for the whole video, 3 question-answering pairs for global mode and 10 question-answering pairs with timestamps for breakpoint mode.
We collect videos from 15 popular categories with varying distribution, including documentary film, detective film, animation film, and so on. Among these, each video comprises multiple alternating scenes, contributing to a diverse and dynamic visual narrative within the context of the collection. Over 90% of the videos exhibit a duration ranging from 10K to 12K frames, while 14.6% of videos extending beyond 12K frames. Only 8.6% of videos have duration less than 10k frames.
Note that MovieChat-1K is specifically designed for long video comprehension tasks, the majority of questions are open-ended, with only a quarter classified as multiple-choice questions, marked by initiators such as βDo,β βDoes,β βIs,β or βAre.β We also compute the word distributions of our provided question-answer pairs, which includes common objects (people, clothes, etc.), time (day, night, etc.), scenes (indoor, outdoor, etc.), and so on.
MovieChat1K exhibits diverse lengths of question-answer pairs in the segmented clip level. Despite the distribution of questionanswer pairs varies between the global mode and breakpoint mode, the majority of questions tends to concentrate between 5-15 words in length, while the length of answers generally have fewer than 10 words.
To facilitate a more detailed understanding of long videos, we provide a dense caption for each video. MovieChat-1K exhibits diverse caption lengths in the segmented clip level. Approximately two-thirds of the clips have captions with 100-149 words, while one-fifth of the clip captions have fewer than 100 words. About 11% of clips have long captions with more than 150 words.
To analyze the word distribution of our generated captions, we compute their distributions. The resulting word distribution of the captions is presented in Fig. B6, which includes common objects (man, woman, people, girl, etc.), attributes (detective, various, small, white, etc.), locations (inside, behind, south, next, etc.), scenes (room, house, building, office, etc.), actions/events (talk, enter, leave, take, etc.), and more.
In terms of actionness, MovieChat-1K captions contains nearly the same number of verbs as with the WebVid10M dataset. To evaluate this, we use the NLTK toolkit to analyze the number of verbs in captions, focusing on extracting and tagging all unique verbs. We find a total of 109,485 verbs in the WebVid10M caption dataset, while the MovieChat-1K captions contain 102,988 unique instances of verbs. While these counts may not be entirely accurate due to our simple counting method, we believe they provide a rough indication of the actionness of the two datasets.
π Β© Due to the copyright concers and the size limitations of the movies, we plan to release the features of the dataset. Please wait for a few weeks.
First, ceate a conda environment:
conda env create -f environment.yml
conda activate moviechat
Before using the repository, make sure you have obtained the following checkpoints:
- Get the original LLaMA weights in the Hugging Face format by following the instructions here.
- Download Vicuna delta weights π [7B] (Note: we use v0 weights instead of v1.1 weights).
- Use the following command to add delta weights to the original LLaMA weights to obtain the Vicuna weights:
python apply_delta.py \
--base ckpt/LLaMA/7B_hf \
--target ckpt/Vicuna/7B \
--delta ckpt/Vicuna/vicuna-7b-delta-v0 \
- Download the MiniGPT-4 model (trained linear layer) from this link.
- Download pretrained weights to run MovieChat with Vicuna-7B as language decoder locally from this link.
Firstly, set the llama_model
, llama_proj_model
and ckpt
in eval_configs/MovieChat.yaml.
Then run the script:
python inference.py \
--cfg-path eval_configs/MovieChat.yaml \
--gpu-id 0 \
--num-beams 1 \
--temperature 1.0 \
--text-query "What is he doing?" \
--video-path src/examples/Cooking_cake.mp4 \
--fragment-video-path src/video_fragment/output.mp4 \
--cur-min 1 \
--cur-sec 1 \
--middle-video 1 \
Note that, if you want to use the global mode (understanding and question-answering for the whole video), remember to change middle-video into 0.
We are grateful for the following awesome projects our MovieChat arising from:
- Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
- Token Merging: Your ViT but Faster
- XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
- MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
- FastChat: An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- EVA-CLIP: Improved Training Techniques for CLIP at Scale
- LLaMA: Open and Efficient Foundation Language Models
- VideoChat: Chat-Centric Video Understanding
- LLaVA: Large Language and Vision Assistant
Our MovieChat is just a research preview intended for non-commercial use only. You must NOT use our MovieChat for any illegal, harmful, violent, racist, or sexual purposes. You are strictly prohibited from engaging in any activity that will potentially violate these guidelines.
If you find MovieChat useful for your your research and applications, please cite using this BibTeX:
@article{song2023moviechat,
title={MovieChat: From Dense Token to Sparse Memory for Long Video Understanding},
author={Song, Enxin and Chai, Wenhao and Wang, Guanhong and Zhang, Yucheng and Zhou, Haoyang and Wu, Feiyang and Guo, Xun and Ye, Tian and Lu, Yan and Hwang, Jenq-Neng and others},
journal={arXiv preprint arXiv:2307.16449},
year={2023}
}