Source codes of Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding.
This code repository was built based on the research A Visual Attention Grounding Neural Model for Multimodal Machine Translation and its open-source pytorch implementations (zmykevin and Eurus-Holmes). Thanks for these efforts.
-
Preprocessed Data of Multi30K and AmbiguousCOCO
Download from [baidu] (Password: ovc0).
< NOTE >
- The object-level visual features are too large (≈12GB).
- It is suggested to download the original Multi30K/AmbiguousCOCO dataset and then extract the visual object features from pre-trained Faster RCNN using the bottom-up-attention project by yourself or using our implement bta_vision_extract.ipynb that further filters objects predicted with low object category probabilities.
-
Data for Similarity Searching in scripts/raw_data
Download from [baidu] (Password: ovc0).
Run the scripts as follows:
# for training
. run_ovc_training.sh
# for evaluation after training
. run_ovc_evaluation.sh
Download from [baidu] (Password: ovc0)
If you use the source codes here in your work, please cite the corresponding paper. The bibtex is listed below:
@inproceedings{wang2021efficient,
title={Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding},
author={Wang, Dexin and Xiong, Deyi},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={35},
number={4},
pages={2720--2728},
year={2021}
}