Vision-Language Models Can Self-Improve Reasoning via Reflection

The code for the paper: Vision-Language Models Can Self-Improve Reasoning via Reflection. This repository contains code that reproduces the self-train results in our paper.

🛠️ Installation & Environment

This codebase is build on VL-RLHF. Many thanks to their open-sourced work.

git clone https://github.com/njucckevin/MM-Self-Improve.git
cd MM-Self-Improve
pip install -e .

It is recommend to install FlashAttention for effective training and inference:

pip install flash-attn==2.5.8 --no-build-isolation

📝 Data&Model Preparation

This codebase currently provide code for VLM self-training on TabMWP, ChartQA and CLEVR-Math dataset. To reproduce the result, we need to first download and unzip this three datasets (TabMWP, ChartQA, CLEVR-Math), and put them under the data/datasets directory. It should look like:

data
├── data_self_train
│   └── ...
└── datasets
    ├── tabmwp
    │   └── ...
    ├── chartqa
    │   └── ...
    └── clevr-math
        └── ...

Then, download the official checkpoint of Qwen-VL-Chat and LLaVA-1.5 from huggingface 🤗.

🚀 Self-Training

Run the following command to launch our self-training of QwenVL-Chat using CLEVR-Math dataset.

python self_train.py --model_name qwenvl --model_ckpt your_qwenvl_ckpt_dir/Qwen-VL-Chat --dataset_name clevr --dataset_dir ./data/datasets/clevr-math --gpu_ids 0,1,2,3,4,5,6,7

model_name: qwenvl or llava, training with Qwen-VL-Chat or LLaVA-1.5.
model_ckpt: the model checkpoint of Qwen-VL-Chat or LLaVA-1.5 downloaded above.
dataset_name: tabmwp, clevr or chartqa, the self-training dataset.
dataset_dir: the corresponding dataset directory.
gpu_ids: the id of gpus you wish to use.

The script will start iteratively self-training and save a log file below ./log to record the training process, including dataset statistics and evaluation metrics.

🚩 Qwen2-VL Results & Scaling of Test-Time Compute

To validate the generalizability of our framework, we applied it to Qwen2-VL, a recently released advanced MLLM. It also demonstrate the ability of our framework to boost the reasoning performance of MLLM through scaling test-time compute. See details in this repo.

Citation

If you find this work helpful, please consider to star 🌟 this repo and cite our paper.

@article{cheng2024vision,
  title={Vision-Language Models Can Self-Improve Reasoning via Reflection},
  author={Cheng, Kanzhi and Li, Yantao and Xu, Fangzhi and Zhang, Jianbing and Zhou, Hao and Liu, Yang},
  journal={arXiv preprint arXiv:2411.00855},
  year={2024}
}

Additionally, this project is build on the VL-RLHF framework.

@misc{vlrlhf,
  title = {VL-RLHF: A RLHF Infrastructure for Vision-Language Model},
  author = {Gongrui Zhang},
  howpublished = {\url{https://github.com/TideDra/VL-RLHF}},
  year = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.idea		.idea
Qwen2VL @ d4a8741		Qwen2VL @ d4a8741
accelerate_config		accelerate_config
assets		assets
data		data
scripts		scripts
src		src
.DS_Store		.DS_Store
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
pos_neg_process_chartqa.py		pos_neg_process_chartqa.py
pos_neg_process_clevr.py		pos_neg_process_clevr.py
pos_neg_process_tabmwp.py		pos_neg_process_tabmwp.py
pyproject.toml		pyproject.toml
self_train.py		self_train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision-Language Models Can Self-Improve Reasoning via Reflection

🛠️ Installation & Environment

📝 Data&Model Preparation

🚀 Self-Training

🚩 Qwen2-VL Results & Scaling of Test-Time Compute

Citation

About

Releases

Packages

Languages

License

njucckevin/MM-Self-Improve

Folders and files

Latest commit

History

Repository files navigation

Vision-Language Models Can Self-Improve Reasoning via Reflection

🛠️ Installation & Environment

📝 Data&Model Preparation

🚀 Self-Training

🚩 Qwen2-VL Results & Scaling of Test-Time Compute

Citation

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages