Comparisons with existing methods 💡

Updates 📌

[2024/7/3] We released the paper of our TokenPacker on Arxiv.
[2024/7/3] We released the training and inference codes.

What is TokenPacker 👀

TokenPacker is a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. Using TokenPacker, we can compress the visual tokens by 75%∼89%, while achieves comparable or even better performance across diverse benchmarks with significantly higher efficiency.

High-Resolution Image Understanding with TokenPacker 🔬

To support efficient high-resolution image understanding, we further develop an effective image cropping method TokenPacker-HD.

Install 🛠️

Clone this repository and navigate to TokenPacker folder

git clone https://github.com/CircleRadon/TokenPacker.git
cd TokenPacker

Install packages

conda create -n tokenpacker python=3.10 -y
conda activate tokenpacker
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Training 🚀

LLaVA-TokenPacker

Dataset

To make a fair comparison, we use the same training data as in LLaVA-1.5, i.e., CC3M-595K for stage 1, and Mix665k for stage 2.

Training

Stage1: Image-Text Alignment Pre-training

bash scripts/v1_5/pretrain.sh

Stage2: Visual Instruction Tuning

bash scripts/v1_5/finetune.sh

Note: Using --down_rate to control compression ratio, support [2,3,4]

LLaVA-TokenPacker-HD

Dataset

To obtain the competitive high-resolution performance, we use 2.7M data as orginazed by Mini-Gemini, i.e., 1.2M for stage 1 and 1.5M for stage 2.

Training

Stage1: Image-Text Alignment Pre-training

bash scripts/v1_5/pretrain_hd.sh

Stage2: Visual Instruction Tuning

bash scripts/v1_5/finetune_hd.sh

Note:

Using --down_rate to control compression ratio, support [2,3,4].
Using --patch_num to control max patch dividing number, support [9,16,25].

Experiments

Visualization

We provide some visual examples.

High-resolution image understanding.

TODO List 📝

Release the training and inference codes.
Release all checkpoints.

Acknowledgement 💌

LLaVA-v1.5: the codebase we built upon.

BibTeX 🖊️

@misc{TokenPacker,
  title={TokenPacker: Efficient Visual Projector for Multimodal LLM},
  author={Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu and Lei Zhang},
  year={2024},
  eprint={2407.02392},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Comparisons with existing methods 💡

Updates 📌

What is TokenPacker 👀

High-Resolution Image Understanding with TokenPacker 🔬

Install 🛠️

Training 🚀

LLaVA-TokenPacker

Dataset

Training

LLaVA-TokenPacker-HD

Dataset

Training

Experiments

Visualization

TODO List 📝

Acknowledgement 💌

BibTeX 🖊️

Files

README.md

Latest commit

History

README.md

File metadata and controls

Comparisons with existing methods 💡

Updates 📌

What is TokenPacker 👀

High-Resolution Image Understanding with TokenPacker 🔬

Install 🛠️

Training 🚀

LLaVA-TokenPacker

Dataset

Training

LLaVA-TokenPacker-HD

Dataset

Training

Experiments

Visualization

TODO List 📝

Acknowledgement 💌

BibTeX 🖊️