Skip to content

Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"

License

Notifications You must be signed in to change notification settings

farewellthree/PPLLaVA

Repository files navigation

PPLLaVA

hf arXiv License NPU Code

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

News 📢

  • [2024/11/4] We have added the gradio chatbox, see the instruction.
  • [2024/10/28] All codes and weights are available now! Welcome to watch this repository for the latest updates.

Introduction 💡

  • PPLLaVA is an effective and efficient video large language model. Our model incorporates three parts:
    • (1) Fine-grained vision-prompt alignment.
    • (2) Visual token compression by user instruction with convolution-style pooling.
    • (3) CLIP context extension.
  • PPLLaVA has established new state-of-the-art results on VideoMME, MVBench, VideoChatGPT Bench, and VideoQA Bench, using only 1024 visual tokens and achieving a throughput 8x faster.
MethodImage PretrainLLMVideoMMEVCGBenchMVBenchActivityNetQA
VideoLLaMABLIP-2Vicuna-7B-1.9634.112.4
LLaMA-Adapter-Vicuna-7B-2.0331.734.2
VideoChatBLIP-2Vicuna-7B-2.2335.526.5
VideoChatGPTLLaVA-1.0Vicuna-7B-2.3832.735.2
BT-AdapterLLaVA-1.0Vicuna-7B-2.69-45.7
LLaMA-VIDInstructBLIPVicuna-13B-2.89-47.4
VideoChat2-Vicuna-7B-2.9851.149.1
Chat-UniViLLaVA-1.5Vicuna-7B45.92.99-47.2
STLLMInstructBLIPVicuna-7B42.33.15-50.9
PLLaVALLaVA-NextVicuna-7B-3.1246.656.3
VLM-RLAIFLLaVA-1.5Vicuna-7B-3.49-57.3
LLaVA-Next-VideoLLaVA-NextVicuna-7B45.03.66-60.2
PPLLaVALLaVA-NextVicuna-7B53.63.7359.260.7

Demo 🤗

Please download the conversation weights from here and follow the instructions in installation first. Then, run the gradio demo:

CUDA_VISIBLE_DEVICES=0 python3 demo.py --ckpt-path /path/to/PPLLaVA_conversation_weight

Examples 👀

  • Video Dense Caption: PPLLaVA can effectively balance the content, state, and motion of both the foreground and background, while maintaining detail and accuracy.
  • Multi-turn dialogue and reasoning: PPLLaVA can engage in smooth Q&A interactions and provide reasonable inferences.

Installation 🛠️

Git clone our repository, create a Python environment and activate it via the following command

git clone https://github.com/farewellthree/PPLLaVA.git
cd PPLLaVA
conda create --name ppllava python=3.9
conda activate ppllava
pip install -r requirement.txt

Training & Validation 📊

The instructions of data, training and evaluating can be found in trainval.md.

Citation ✏️

If you find the code and paper useful for your research, please consider staring this repo and citing our paper:

@inproceedings{liu2025st,
  title={St-llm: Large language models are effective temporal learners},
  author={Liu, Ruyang and Li, Chen and Tang, Haoran and Ge, Yixiao and Shan, Ying and Li, Ge},
  booktitle={European Conference on Computer Vision},
  pages={1--18},
  year={2025},
  organization={Springer}
}
@article{liu2024ppllava,
  title={PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance},
  author={Liu, Ruyang and Tang, Haoran and Liu, Haibo and Ge, Yixiao and Shan, Ying and Li, Chen and Yang, Jiankun},
  journal={arXiv preprint arXiv:2411.02327},
  year={2024}
}

About

Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published