Cross on Cross Attention: Deep Fusion Transformer for Image Captioning

Environment setup

Clone the repository and create the dft conda environment using the environment.yml file:

conda env create -f environment.yml
conda activate dft

Then download spacy data by executing the following command:

python -m spacy download en

Note: Python 3.8 is required to run our code.

Data preparation

To run the code, annotations and visual features for the COCO dataset are needed. Please download the annotations file annotations.zip (Extraction code: ska0) and extract it.

To reproduce our result, please generate the corresponding feature files (COCO2014_RN50x4_GLOBAL.hdf5, COCO2014_VinVL.hdf5) using the code in the tools folder, in which features of each image are stored under the <image_id>_features key. <image_id> is the id of each COCO image, without leading zeros (e.g. the <image_id> for COCO_val2014_000000037209.jpg is 37209). VinVL region feature dimension is (N, 2048), N is the number of region features; CLIP grid feature dimension is (M, 2560), M is the number of grid features.

Evaluation

To reproduce the results reported in our paper, download the pretrained model file dft.pth (Extraction code: ska0) and place it in the code folder.

Training procedure

Run python train.py using the following arguments:

Argument	Possible values
`--output`	Output path
`--exp_name`	Experiment name
`--batch_size`	Batch size (default: 20)
`--workers`	Number of workers (default: 8)
`--warmup`	Warmup value for learning rate scheduling (default: 10000)
`--N_enc`	Number of encoder layers
`--N_dec`	Number of decoder layers
`--resume_last`	If used, the training will be resumed from the last checkpoint.
`--resume_best`	If used, the training will be resumed from the best checkpoint.
`--use_rl`	Whether to turn on reinforcement learning
`--clip_path`	CLIP grid feature path
`--vinvl_path`	VinVL region feature path
`--features_path`	Path to detection features file
`--annotation_folder`	Path to folder with COCO annotations

For example, to train our model with the parameters used in our experiments, use

python train.py --exp_name dft --batch_size 20 --clip_path /path/to/clip_gird_features --vinvl_path /path/to/vinvl_region_features --annotation_folder /path/to/annotations

References

[1] Cornia M, Stefanini M, Baraldi L, et al. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[2] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning.
[3] Zhang P, Li X, Hu X, et al. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Acknowledgements

Thank Cornia et.al for their open source code (meshed-memory-transformer ), on which our implements are based.
Thanks to Zhang et al. for the powerful region features (VinVL).

Name	Name	Last commit message	Last commit date
Latest commit weimingboya update env Jun 2, 2023 55eee17 · Jun 2, 2023 History 7 Commits
cache	cache	init	Jan 8, 2023
common	common	init	Jan 8, 2023
models	models	init	Jan 8, 2023
tools	tools	add tools	Mar 21, 2023
.gitignore	.gitignore	init	Jan 8, 2023
README.md	README.md	Merge branch 'main' of github.com:weimingboya/DFT into main	Mar 21, 2023
environment.yml	environment.yml	update env	Jun 2, 2023
requirements.txt	requirements.txt	update env	Jun 2, 2023
train.py	train.py	init	Jan 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross on Cross Attention: Deep Fusion Transformer for Image Captioning

Environment setup

Data preparation

Evaluation

Training procedure

References

Acknowledgements

About

Releases

Packages

Languages

weimingboya/DFT

Folders and files

Latest commit

History

Repository files navigation

Cross on Cross Attention: Deep Fusion Transformer for Image Captioning

Environment setup

Data preparation

Evaluation

Training procedure

References

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages