Clone the repository and create the dft
conda environment using the environment.yml
conda env create -f environment.yml
conda activate dft
Then download spacy data by executing the following command:
python -m spacy download en
Note: Python 3.8 is required to run our code.
To run the code, annotations and visual features for the COCO dataset are needed. Please download the annotations file (Extraction code: ska0) and extract it.
To reproduce our result, please generate the corresponding feature files (COCO2014_RN50x4_GLOBAL.hdf5
, COCO2014_VinVL.hdf5
) using the code in the tools folder, in which features of each image are stored under the <image_id>_features
key. <image_id>
is the id of each COCO image, without leading zeros (e.g. the <image_id>
for COCO_val2014_000000037209.jpg
is 37209
). VinVL region feature dimension is (N, 2048), N is the number of region features; CLIP grid feature dimension is (M, 2560), M is the number of grid features.
To reproduce the results reported in our paper, download the pretrained model file dft.pth (Extraction code: ska0) and place it in the code folder.
Run python
using the following arguments:
Argument | Possible values |
--output |
Output path |
--exp_name |
Experiment name |
--batch_size |
Batch size (default: 20) |
--workers |
Number of workers (default: 8) |
--warmup |
Warmup value for learning rate scheduling (default: 10000) |
--N_enc |
Number of encoder layers |
--N_dec |
Number of decoder layers |
--resume_last |
If used, the training will be resumed from the last checkpoint. |
--resume_best |
If used, the training will be resumed from the best checkpoint. |
--use_rl |
Whether to turn on reinforcement learning |
--clip_path |
CLIP grid feature path |
--vinvl_path |
VinVL region feature path |
--features_path |
Path to detection features file |
--annotation_folder |
Path to folder with COCO annotations |
For example, to train our model with the parameters used in our experiments, use
python --exp_name dft --batch_size 20 --clip_path /path/to/clip_gird_features --vinvl_path /path/to/vinvl_region_features --annotation_folder /path/to/annotations
[1] Cornia M, Stefanini M, Baraldi L, et al. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[2] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning.
[3] Zhang P, Li X, Hu X, et al. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Thank Cornia for their open source code (meshed-memory-transformer
), on which our implements are based.
Thanks to Zhang et al. for the powerful region features (VinVL).