- [2024/11/4] We have added the gradio chatbox, see the instruction.
- [2024/10/28] All codes and weights are available now! Welcome to watch this repository for the latest updates.
- PPLLaVA is an effective and efficient video large language model. Our model incorporates three parts:
- (1) Fine-grained vision-prompt alignment.
- (2) Visual token compression by user instruction with convolution-style pooling.
- (3) CLIP context extension.
- PPLLaVA has established new state-of-the-art results on VideoMME, MVBench, VideoChatGPT Bench, and VideoQA Bench, using only 1024 visual tokens and achieving a throughput 8x faster.
Method | Image Pretrain | LLM | VideoMME | VCGBench | MVBench | ActivityNetQA |
---|---|---|---|---|---|---|
VideoLLaMA | BLIP-2 | Vicuna-7B | - | 1.96 | 34.1 | 12.4 |
LLaMA-Adapter | - | Vicuna-7B | - | 2.03 | 31.7 | 34.2 |
VideoChat | BLIP-2 | Vicuna-7B | - | 2.23 | 35.5 | 26.5 |
VideoChatGPT | LLaVA-1.0 | Vicuna-7B | - | 2.38 | 32.7 | 35.2 |
BT-Adapter | LLaVA-1.0 | Vicuna-7B | - | 2.69 | - | 45.7 |
LLaMA-VID | InstructBLIP | Vicuna-13B | - | 2.89 | - | 47.4 |
VideoChat2 | - | Vicuna-7B | - | 2.98 | 51.1 | 49.1 |
Chat-UniVi | LLaVA-1.5 | Vicuna-7B | 45.9 | 2.99 | - | 47.2 |
STLLM | InstructBLIP | Vicuna-7B | 42.3 | 3.15 | - | 50.9 |
PLLaVA | LLaVA-Next | Vicuna-7B | - | 3.12 | 46.6 | 56.3 |
VLM-RLAIF | LLaVA-1.5 | Vicuna-7B | - | 3.49 | - | 57.3 |
LLaVA-Next-Video | LLaVA-Next | Vicuna-7B | 45.0 | 3.66 | - | 60.2 |
PPLLaVA | LLaVA-Next | Vicuna-7B | 53.6 | 3.73 | 59.2 | 60.7 |
Please download the conversation weights from here and follow the instructions in installation first. Then, run the gradio demo:
CUDA_VISIBLE_DEVICES=0 python3 demo.py --ckpt-path /path/to/PPLLaVA_conversation_weight
- Video Dense Caption: PPLLaVA can effectively balance the content, state, and motion of both the foreground and background, while maintaining detail and accuracy.
- Multi-turn dialogue and reasoning: PPLLaVA can engage in smooth Q&A interactions and provide reasonable inferences.
Git clone our repository, create a Python environment and activate it via the following command
git clone https://github.com/farewellthree/PPLLaVA.git
cd PPLLaVA
conda create --name ppllava python=3.9
conda activate ppllava
pip install -r requirement.txt
The instructions of data, training and evaluating can be found in trainval.md.
If you find the code and paper useful for your research, please consider staring this repo and citing our paper:
@inproceedings{liu2025st,
title={St-llm: Large language models are effective temporal learners},
author={Liu, Ruyang and Li, Chen and Tang, Haoran and Ge, Yixiao and Shan, Ying and Li, Ge},
booktitle={European Conference on Computer Vision},
pages={1--18},
year={2025},
organization={Springer}
}
@article{liu2024ppllava,
title={PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance},
author={Liu, Ruyang and Tang, Haoran and Liu, Haibo and Ge, Yixiao and Shan, Ying and Li, Chen and Yang, Jiankun},
journal={arXiv preprint arXiv:2411.02327},
year={2024}
}