Paper: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts.
Official PyTorch implementation and pre-trained models of VLMo.
- Dec, 2022: Code & model release.
- Sep, 2022: VLMo was accepted by NeurIPS 2022.
- May 30th, 2022: new version of VLMo paper on arXiv.
- November 24th, 2021: VLMo Large (single model) as the new SOTA on the VQA Challenge
- Nov 2021: release preprint in arXiv
We provide three VLMo weights pre-trained on COCO, VG, SBU and GCC. The models were pre-trained with 224x224 resolution.
VLMo-base
: #layer=12; hidden=768; FFN factor=4x; #head=12; patch=16x16; #VL_FFN=2 (#parameters: 175M)VLMo-base_plus
: #layer=24; hidden=544; FFN factor=4x; #head=16; patch=16x16; #VL_FFN=3 (#parameters: 167M)VLMo-large
: #layer=24; hidden=1024; FFN factor=4x; #head=16; patch=16x16; #VL_FFN=3 (#parameters: 562M)
alias=`whoami | cut -d'.' -f2`; docker run -it --rm --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} pytorch/pytorch:1.8.0-cuda11.1-cudnn8-devel bash
First, clone the repo and install required packages:
git clone https://github.com/microsoft/unilm.git
cd unilm/vlmo
pip install -r requirements.txt
We process the pre-training and fine-tuning data to the same format as in ViLT.
Replace <ARROW_ROOT>
as your data dir in following commands.
Download the pre-trained model weight from BEiT repo.
# download from https://github.com/addf400/files/releases/download/v1.0/beit_base_patch16_224_pt22k_ft22kto1k.pth
export INIT_CKPT=/path/to/save/beit_base_checkpoint
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_textmlm_base whole_word_masking=True step200k per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=$INIT_CKPT log_dir=<YOUR_OUTPUT_PATH>
Or you can download our pre-trained ckpts for this stage:
export INIT_CKPT=/path/to/save/last_stage_ckpt
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_mlm_itm_itc_base whole_word_masking=True step200k per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=$INIT_CKPT log_dir=<YOUR_OUTPUT_PATH>
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> "<CONFIG_NAME>" per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path="<VLMo_WEIGHT>" log_dir=<YOUR_OUTPUT_PATH>
To reduce GPU memory cost, use Deepspeed and Activation Checkpoint.
You can found "<CONFIG_NAME>" for each task as follows:
<CONFIG_NAME> | initialized checkpoint | finetuned weight | test-dev |
---|---|---|---|
task_finetune_vqa_base_image480 | VLMo-base | weight | 76.6 |
task_finetune_vqa_base_plus_image480 | VLMo-base_plus | weight | 78.5 |
task_finetune_vqa_large_image480 | VLMo-large | weight | 79.9 |
<CONFIG_NAME> | initialized checkpoint | finetuned weight | test-P |
---|---|---|---|
task_finetune_nlvr2_base_image384 | VLMo-base | weight | 83.3 |
task_finetune_nlvr2_base_plus_image384 | VLMo-base_plus | weight | 85.1 |
task_finetune_nlvr2_large_image384 | VLMo-large | weight | 86.9 |
<CONFIG_NAME> | initialized checkpoint | finetuned weight | TR@1 | IR@1 |
---|---|---|---|---|
task_finetune_irtr_coco_base_image384 | VLMo-base | weight | 74.8 | 57.2 |
task_finetune_irtr_coco_base_plus_image384 | VLMo-base_plus | weight | 76.3 | 58.6 |
task_finetune_irtr_coco_large_image384 | VLMo-large | weight | 78.2 | 60.6 |
<CONFIG_NAME> | initialized checkpoint | finetuned weight | TR@1 | IR@1 |
---|---|---|---|---|
task_finetune_irtr_f30k_base_image384 | VLMo-base_coco_finetuned | weight | 92.3 | 79.3 |
task_finetune_irtr_f30k_base_plus_image384 | VLMo-base_plus | weight | 93.2 | 81.8 |
task_finetune_irtr_f30k_large_image384 | VLMo-large_coco_finetuned | weight | 95.3 | 84.5 |
To eval a finetuned model by appending test_only=True
and set load_path=
to the finetuned VLMo weight as follow:
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=1 "<CONFIG_NAME>" per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path="<Finetuned_VLMo_WEIGHT>" test_only=True
- For retrieval tasks, also set
get_recall_metric=True
in the command.
This repository is built using the ViLT repository, BEiT repository, ALBEF and the timm library.
If you find this repository useful, please consider citing our work:
@inproceedings{vlmo,
title={{VLMo}: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts},
author={Hangbo Bao and Wenhui Wang and Li Dong and Qiang Liu and Owais Khan Mohammed and Kriti Aggarwal and Subhojit Som and Songhao Piao and Furu Wei},
booktitle={Advances in Neural Information Processing Systems},
year={2022},
url={https://openreview.net/forum?id=bydKs84JEyw}
}
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Microsoft Open Source Code of Conduct
For help or issues using VLMo models, please submit a GitHub issue.