Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
Gradio_demo		Gradio_demo
MovieChat		MovieChat
__MACOSX		__MACOSX
eval_code		eval_code
eval_configs		eval_configs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE_Lavis.md		LICENSE_Lavis.md
LICENSE_Minigpt4.md		LICENSE_Minigpt4.md
LICENSE_videollama		LICENSE_videollama
README.md		README.md
apply_delta.py		apply_delta.py
convert_llama_to_hf.py		convert_llama_to_hf.py
environment.yml		environment.yml
inference.py		inference.py
requirements.txt		requirements.txt

Repository files navigation

From Dense Token to Sparse Memory for Long Video Understanding

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Enxin Song*, Wenhao Chai*, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, Gaoang Wang✉️
arXiv 2023.

MovieChat can handle videos with >10K frames on a 24GB graphics card. MovieChat has a 10000× advantage over other methods in terms of the average increase in GPU memory cost per frame (21.3KB/f to ~200MB/f).

🔥 News

[2023.8.1] 📃 We release the paper.
[2023.7.31] We release eval code and instraction for short video QA on MSVD-QA, MSRVTT-QA and ActivityNet-QA.
[2023.7.29] We release Gradio demo of MovieChat.
[2023.7.22] We release source code of MovieChat.

Overview

Examples

Question and answer about clips from Zootopia, a cartoon, which tells the story of a determined police officer rabbit named Judy who pairs up with a cunning fox to uncover a conspiracy about missing animals and develop an unexpected friendship.

Question and answer about clips from Goblin, which tells the story of Kim Shin, an immortal ”goblin” who needs to find a human bride to end his endless life but instead meets Ji Eun-tak, a girl fated to die who claims to be the ”goblin’s bride,” leading to a romantic tale unfolding bet.

Install

Environment Preparation

First, ceate a conda environment:

conda env create -f environment.yml
conda activate moviechat

Prerequisites

Before using the repository, make sure you have obtained the following checkpoints:

Pre-trained Language Decoder

Get the original LLaMA weights in the Hugging Face format by following the instructions here.
Download Vicuna delta weights 👉 [7B] (Note: we use v0 weights instead of v1.1 weights).
Use the following command to add delta weights to the original LLaMA weights to obtain the Vicuna weights:

python apply_delta.py \
    --base ckpt/LLaMA/7B_hf \
    --target ckpt/Vicuna/7B \
    --delta ckpt/Vicuna/vicuna-7b-delta-v0 \

Pre-trained Visual Encoder for MovieChat

Download the MiniGPT-4 model (trained linear layer) from this link.

Download Pretrained Weights

Download pretrained weights to run MovieChat with Vicuna-7B as language decoder locally from this link.

How to Run Demo Locally

Firstly, set the llama_model, llama_proj_model and ckpt in eval_configs/MovieChat.yaml. Then run the script:

python inference.py \
    --cfg-path eval_configs/MovieChat.yaml \
    --gpu-id 0 \
    --num-beams 1 \
    --temperature 1.0 \
    --text-query "What is he doing?" \
    --video-path src/examples/Cooking_cake.mp4 \
    --fragment-video-path src/video_fragment/output.mp4 \
    --cur-min 1 \
    --cur-sec 1 \
    --middle-video 1 \

Note that, if you want to use the global mode (understanding and question-answering for the whole video), remember to change middle-video into 0.

Acknowledgement

We are grateful for the following awesome projects our MovieChat arising from:

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Token Merging: Your ViT but Faster
XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
FastChat: An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
EVA-CLIP: Improved Training Techniques for CLIP at Scale
LLaMA: Open and Efficient Foundation Language Models
VideoChat: Chat-Centric Video Understanding
LLaVA: Large Language and Vision Assistant

Term of Use

Our MovieChat is just a research preview intended for non-commercial use only. You must NOT use our MovieChat for any illegal, harmful, violent, racist, or sexual purposes. You are strictly prohibited from engaging in any activity that will potentially violate these guidelines.

Citation

If you find MovieChat useful for your your research and applications, please cite using this BibTeX:

@article{song2023moviechat,
  title={MovieChat: From Dense Token to Sparse Memory for Long Video Understanding},
  author={Song, Enxin and Chai, Wenhao and Wang, Guanhong and Zhang, Yucheng and Zhou, Haoyang and Wu, Feiyang and Guo, Xun and Ye, Tian and Lu, Yan and Hwang, Jenq-Neng and others},
  journal={arXiv preprint arXiv:2307.16449},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

From Dense Token to Sparse Memory for Long Video Understanding

🔥 News

Overview

Examples

Install

Environment Preparation

Prerequisites

Pre-trained Language Decoder

Pre-trained Visual Encoder for MovieChat

Download Pretrained Weights

How to Run Demo Locally

Acknowledgement

Term of Use

Citation

About

Licenses found

Releases

Packages

Contributors 5

Languages

License

Licenses found

rese1f/MovieChat

Folders and files

Latest commit

History

Repository files navigation

From Dense Token to Sparse Memory for Long Video Understanding

🔥 News

Overview

Examples

Install

Environment Preparation

Prerequisites

Pre-trained Language Decoder

Pre-trained Visual Encoder for MovieChat

Download Pretrained Weights

How to Run Demo Locally

Acknowledgement

Term of Use

Citation

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages