This is the official implementation of the paper "Modular Prompt Learning Improves Vision-Language Models".
We propose a modular design for deep prompting methods.
Overview of our approach: Our proposed deep prompting method has add, remove and carry operations to control context lengths.
- Data Preparation
Please follow DATASETS.md to install the datasets.
Please follow the steps below to create a virtual environment and install dependencies.
- Setup virtual environment
# create an virtual environment
conda create -n mdprompt python=3.10
# activate virtual environment
conda activate mdprompt
# install dependencies
pip install -r requirements.txt
- Run experiments
(1) Base-to-Novel class generalization setting
# train and evaluate on base classes
# 1st argument is dataset, possible datasets include caltech101, food101, dtd, ucf101, oxford_flowers, oxford_pets, fgvc_aircraft, stanford_cars, sun397, eurosat, imagenet
# 2nd argument is seed
bash scripts/mpl/base2new_train.sh imagenet 1
# test on noval classes
bash scripts/mpl/base2new_test.sh imagenet 1
(2) Cross-Dataset Transfer
Firstly train MPL on ImageNet using few-shot learning:
# seed=1
bash scripts/mpl/xd_train.sh imagenet 1
# seed=2
bash scripts/mpl/xd_train.sh imagenet 2
# seed=3
bash scripts/mpl/xd_train.sh imagenet 3
Secondly evaluate MPL on downstream datasets:
for SEED in 1 2 3
do
bash scripts/mpl/xd_test.sh caltech101 ${SEED}
bash scripts/mpl/xd_test.sh food101 ${SEED}
done
If you find our work helpful, please consider citing:
@inproceedings{huangmodular,
title={Modular Prompt Learning Improves Vision-Language Models},
author={Zhenhan Huang, Tejaswini Pedapati, Pin-Yu Chen and Jianxi Gao},
booktitle={IEEE International Conference on Acoustics, Speech, and Signal Processing},
year={2025}
}
This repository based on CoOp and MaPLe repository. If you think our released code useful, please consider citing these works as well.