Skip to content

Latest commit






This project is implemented for the WSDM2023 paper: "AGREE: Aligning Cross-Modal Entities for Image-Text Retrieval Upon Vision-Language Pre-trained Models". Our code is based on pytorch.

AGREE is a lightweight and practical approach to align cross-modal entities for image-text retrieval upon VLP models only at the fine-tuning and re-ranking stages. We employ external knowledge and tools to construct extra fine-grained image-text pairs, and then emphasize cross-modal entity alignment through contrastive learning and entity-level mask modeling in fine-tuning. Besides, two re-ranking strategies are proposed, including one specially designed for zero-shot scenarios.

We choose CLIP as the VLP model show our fine-tuning and re-ranking stages models in the repo.

Data preparation

We use the popular public benchmark dataset Flickr30k for evaluation. You can download directly from the official website. We also provide the pre-processed data for convenience.

  • Download the preprocessed training and validation Flickr30k data by:

    if [ ! -f ./tmp/datasets/flickr30k_images.tgz ]; then
        wget -P ./tmp/datasets
        tar zxvf ./tmp/datasets/flickr30k_images.tgz -C ./tmp/datasets
  • The pre-processed training data contains our pre-extracted textual entities, and grounded textual entities with their predicted probabilities. The format of an item in " flickr30k_train_pred.jsonl " file is:

    {"image": "flickr30k-images/10002456.jpg", "caption": "Four men on top of a tall structure.", "image_id": 1, "segs": ["four men", "top of a tall structure", "top", "four men on top of a tall structure.", "a tall structure"], "preds": [{"men": 0.6230936527252198}, {"object": 0.6065886701856341}, {"top": 0.6251628398895264}, {"a tall structure": 0.5054895877838135}]}

We also provide the preprocessed and re-organized data of visual entities from Visual Genome, including image arrays and its mapping relationship with visual entities. Original data of Visual Genome comes from URL.

  • Preprare the preprocessed data of visual entities by:

    if [ ! -f ./tmp/VG.tgz ]; then
        wget -P ./tmp
        tar zxvf ./tmp/VG.tgz -C ./tmp

Besides, we provide the images where the grounded textual entities are masked in the image. We utilize GLIP for visual grounding.

  • Preprare the masked data of grounded textual entities by:

    if [ ! -f ./tmp/datasets/flickr30k_images/flickr30k_visual_grounding.224.npz ]; then
        wget -P ./tmp/datasets/flickr30k_images/

Model Preparation

  • Download CLIP pretrained checkpoint (ViT-L/14 in this repo):

    if [ ! -f ./tmp/pretrained_models/ ]; then
        wget -P ./tmp/pretrained_models


  • Training

    python3 -u training/ \
        --save-frequency 2 \
        --report-to tensorboard \
        --train-data="${DATAPATH}/annotation/flickr30k_train_pred.jsonl"  \
        --val-data="${DATAPATH}/annotation/flickr30k_val.json"  \
        --img-key image \
        --caption-key caption \
        --dataset-type json \
        --is-mask \
        --is-prompt \
        --is-da-loss \
        --is-da-mask \
        --is-vg \
        --dist-url="tcp://" \
        --warmup 10000 \
        --batch-size=8 \
        --lr=1e-5 \
        --wd=0.1 \
        --epochs=50 \
        --workers=0 \
        --model ViT-L/14 \
  • Feature Extraction

    The resume path and experiment name should be replaced with your fine-tuned model.

    if [ ! -d $SAVEFOLDER ]; then
            mkdir $SAVEFOLDER
    echo "=========== Feature Extraction ==========="
    python3 -u eval/ \
        --extract-image-feats \
        --extract-text-feats \
        --image-data="${DATAPATH}/annotation/flickr30k_test_images.npz" \
        --text-data="${DATAPATH}/annotation/flickr30k_test_texts.jsonl" \
        --img-batch-size=32 \
        --text-batch-size=32 \
        --resume="${experiment_name}/checkpoints/" \
        --image-feat-output-path="${SAVEFOLDER}/flickr30k_test_images.img_feat.jsonl" \
        --text-feat-output-path="${SAVEFOLDER}/flickr30k_test_texts.txt_feat.jsonl" \
        --model ViT-L-14

    The output feature files are saved in user-defined $SAVEFOLDER in the following format:

    ("image_id" for images predictions and "query_id" for the texts)

    {"image_id": "9", "feature": [0.038518026471138,....]}
    {"query_id": 8, "feature": [-0.014490452595055103,....]}
  • Predict

    Note that the prediction procedure should be executed after extracting features.

    python3 -u eval/ \
        --image-feats="${SAVEFOLDER}/flickr30k_test_images.img_feat.jsonl" \
        --text-feats="${SAVEFOLDER}/flickr30k_test_texts.txt_feat.jsonl" \
        --top-k=10 \
        --eval-batch-size=32 \
        --output-images="${SAVEFOLDER}/test_images_predictions.jsonl" \
  • Re-ranking

    For fine-tuning results, as the paper reported, we only utilize TBR (Text-Image Bi-directional Re-ranking) module for re-ranking.

    python3 -u eval/ "${DATAPATH}/annotation/flickr30k_test_images.jsonl" \
    "${SAVEFOLDER}/test_images_predictions.jsonl" \
    "${DATAPATH}/annotation/flickr30k_test_texts.jsonl" \
    "${SAVEFOLDER}/test_texts_predictions.jsonl" \
    "${SAVEFOLDER}/test_images_predictions_rerank.jsonl" \
    "${SAVEFOLDER}/test_texts_predictions_rerank.jsonl" \

    The commands will output prediction files after re-ranking.

  • Evaluation

    python3 -u eval/ \
    "${DATAPATH}/annotation/flickr30k_test_images.jsonl" \
    "${SAVEFOLDER}/test_images_predictions_rerank.jsonl" \
    "${SAVEFOLDER}/image_output_rerank.json" \

    The evaluation procedure will read the predictions to compare with the ground-truth. Evaluation results are saved in user-defined "text_output_rerank.json" and "image_output_rerank.json" in the following format:

    {"success": true, "score": 95.93333333333334, "scoreJson": {"score": 95.93333333333334, "mean_recall": 95.93333333333334, "r1": 89.9, "r5": 98.4, "r10": 99.5}}

Zero-shot Re-ranking

Here we provide examples on pre-trained CLIP (ViT-L/14) for zero-shot re-ranking with AGREE re-ranking procedures, including feature extraction, predict, re-ranking and evaluation.

  • Feature Extraction

    The features of masked and prompted textual entities are also extracted, for re-ranking procedures.

    python3 -u eval/ \
        --extract-image-feats \
        --extract-text-feats \
        --extract-mask-feats \
        --extract-prompt-feats \
        --image-data="${DATAPATH}/annotation/flickr30k_test_images.npz" \
        --text-data="${DATAPATH}/annotation/flickr30k_test_texts.jsonl" \
        --prompt-data="${DATAPATH}/annotation/flickr30k_test_texts_segs.jsonl" \
        --img-batch-size=32 \
        --text-batch-size=32 \
        --resume="${MODELPATH}/" \
        --image-feat-output-path="${SAVEFOLDER}/flickr30k_test_imgs.224.img_feat.jsonl" \
        --text-feat-output-path="${SAVEFOLDER}/flickr30k_test_queries.txt_feat.jsonl" \
        --mask-feat-output-path="${SAVEFOLDER}/flickr30k_test_queries.mask_feat.jsonl" \
        --prompt-feat-output-path="${SAVEFOLDER}/flickr30k_test_queries.prompt_feat.jsonl" \
        --model ViT-L-14

    The extracted textual extities are pre-extracted in file " flickr30k_test_texts_segs.jsonl " for EGR (Textual Entity-Guided Re-ranking) module.

  • Predict

    python3 -u eval/ \
        --image-feats="${SAVEFOLDER}/flickr30k_test_imgs.224.img_feat.jsonl" \
        --text-feats="${SAVEFOLDER}/flickr30k_test_queries.txt_feat.jsonl" \
        --mask-feats="${SAVEFOLDER}/flickr30k_test_queries.mask_feat.jsonl" \
        --prompt-feats="${SAVEFOLDER}/flickr30k_test_queries.prompt_feat.jsonl" \
        --top-k=20 \
        --eval-batch-size=32 \
        --output-images="${SAVEFOLDER}/test_images_predictions_mask+prompt.jsonl" \
  • TBR Re-ranking and Evaluation

    echo "=========== TBR Re-ranking ==========="
    python3 -u eval/ "${DATAPATH}/annotation/flickr30k_test_images.jsonl" \
    "${SAVEFOLDER}/test_images_predictions_mask+prompt.jsonl" \
    "${DATAPATH}/annotation/flickr30k_test_texts.jsonl" \
    "${SAVEFOLDER}/test_texts_predictions_mask+prompt.jsonl" \
    "${SAVEFOLDER}/test_images_predictions_mask+prompt_rerank.jsonl" \
    "${SAVEFOLDER}/test_texts_predictions_mask+prompt_rerank.jsonl" \
    echo "=========== Evaluation ==========="
    python3 -u eval/ "${DATAPATH}/annotation/flickr30k_test_texts.jsonl" \
    "${SAVEFOLDER}/test_texts_predictions_mask+prompt_rerank.jsonl" \
    "${SAVEFOLDER}/text_output_mask+prompt_rerank.json" \
    python3 -u eval/ "${DATAPATH}/annotation/flickr30k_test_images.jsonl" \
    "${SAVEFOLDER}/test_images_predictions_mask+prompt_rerank.jsonl" \
    "${SAVEFOLDER}/image_output_mask+prompt_rerank.json" \


  • Learning Transferable Visual Models From Natural Language Supervision. [paper][website]
  • From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. [paper][website]
  • Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. [paper][website]
  • Grounded Language-Image Pre-training. [paper][website]


Our implementation of AGREE benefits from OpenAI's CLIP and the implementation version OpenCLIP. The visual grounding parts are learnt from GLIP. We thank the original authors for their open-sourcing. Thanks for their wonderful works.