The code for our paper, Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning (WACV 2023). Our methods implemented here provide a switch to discriminative image captioning: given off-the-shelf captioning models trained with reinforcement learning, our methods enable them to describe characteristic details of input images with only a lightweight fine-tuning.
The code is based on self-critical.pytorch. We thank the authors of the repository, the original neuraltalk2, and awesome PyTorch team.
git clone https://github.com/ukyh/switch_disc_caption.git
cd switch_disc_caption
git submodule update --init --recursive
conda create --name switch_disc_cap python=3.6
conda activate switch_disc_cap
pip install -r requirements.txt
- Follow the instruction in data/README.md to download and preprcess data.
- Follow the instruction in coco-caption/README.md to download evaluation tools.
- Download pre-trained models from MODEL_ZOO.md. We used
Att2in+self_critical
(att2in_scst),UpDown+self_critical
(updown_scst), andTransformer+self_critical
(trans_scst) for the experiments of our paper. To runexpt_scripts
, downloaded models have to be placed as follows:
./saved_models/
├── att2in_scst/
│ ├── model-best.pth
│ └── infos_a2i2_sc-best.pkl
├── updown_scst/
│ ├── model-best.pth
│ └── infos_tds_sc-best.pkl
└── trans_scst/
├── model-best.pth
└── infos_trans_scl-best.pkl
- (Optional: not necessary if you just want to try our fine-tuning)
If you want to train RL models in this repo, get the cache for calculating cider score:
python scripts/prepro_ngrams.py --input_json data/dataset_coco.json --dict_json data/cocotalk.json --output_pkl data/coco-train --split train
Run sh expt_scripts/[SELECT_SCRIPT].sh
.
It returns a fine-tuned model under saved_models
and .json
output files (MS COCO Karpathy val/test split) under eval_results
.
We have released the fine-tuned models and output files here.
Evaluation uses the output files under eval_results
.
Use the following repositories/scripts for evaluation in each metric.
NOTE: DO NOT use the files start with tmpeval_
as the decoding methods of those outputs (beam size and BP decoding) are not specified correctly.
CIDEr
,SPICE
,CLIPScore
,RefCLIPScore
: https://github.com/ukyh/clipscore_cocout.gitR@K
: https://github.com/ukyh/vsepp_cocout.gitTIGEr
: https://github.com/ukyh/tiger_cocout.gitimproved BERTScore
: https://github.com/ukyh/bertspp_cocout.gitUnique-1/S
,Length
,Repetition
:python stats_vocab.py eval_results/FILE_NAME.json
OOR
:python stats_oor.py eval_results/FILE_NAME.json
If you find this repo useful, please consider citing (no obligation at all):
@inproceedings{honda2023switch,
title={Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning},
author={Honda, Ukyo and Taro, Watanabe and Yuji, Matsumoto},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2023}
}
@article{luo2018discriminability,
title={Discriminability objective for training descriptive captions},
author={Luo, Ruotian and Price, Brian and Cohen, Scott and Shakhnarovich, Gregory},
journal={arXiv preprint arXiv:1803.04376},
year={2018}
}