- This repository contains the code necessary to replicate the paper "Efficient, Lightweight Multi-Frame Vision Language Model for Visual Question Answering in Autonomous Driving", which was accepted to the Vision & Language for Autonomous Driving & Robotics Workshop at CVPR 2024.
- Authors: Akshay Gopalkrishnan, Ross Greer, and Mohan Trivedi
If you find our code and research paper useful, please cite our paper as following:
@article{gopalkrishnan2024multi,
title={Multi-Frame, Lightweight \& Efficient Vision-Language Models for Question Answering in Autonomous Driving},
author={Gopalkrishnan, Akshay and Greer, Ross and Trivedi, Mohan},
journal={arXiv preprint arXiv:2403.19838},
year={2024}
}
- Clone this repository
- In the repository directory, run
mkdir multi_frame_results
- To replicate our environment use the
env.yml
we have provided. The following commands should create a proper environment:
conda env create -f env.yml
conda activate EM-VLM4AD
- You can download the model weights for the T5-Base and T5-Large-Q version of EM-VLM4AD at the following links. Put the folders for each of these models into the
multi_frame_results
folder. Your directory should look like the following:
└── rootFolder
├── multi_frame_results/
├── T5-Medium/
├── latest_model.pth
├── T5-Large/
├── latest_model.pth
First download the train/val/test split here in your root folder. This will include data from the DriveLM dataset as well as the train/val/test splits we use for our experiments. The folder structure should now be as follows:
└── rootFolder
├── data/
├── multi_frame/
├── multi_frame_train.json
├── multi_frame_val.json
├── multi_frame_test.json
├── multi_frame_test_coco.json
├── image_id.json
├── QA_dataset_nus/
├── v1_0_train_nus.json
├── nuscenes/
├── samples/
├── multi_frame_results/
├── T5-Medium/
├── T5-Large/
- To run training, run
python train.py --batch-size [BATCH SIZE] --epochs [EPOCHS] --lm {T5-Base, T5-Large}
. For more information on other hyperparameters such as loading checkpoints or altering learning rate, weight decay, or the hidden size for gated pooling attention, runpython train.py --help
.
- For inference to generate BLEU-4, CIDEr, METEOR, and ROUGE_L metrics for trained models, you can run
python eval.py --batch-size [BATCH_SIZE] --lm {T5-Base, T5-Large} --checkpoint-file [CHECKPOINT_FOLDER]
. For more information on other hyperparameters to work for different model configurations, runpython eval.py --help
. - We use the pycocoevalcap library to generate the caption metrics we evaluate on. For this library, Java needs to be installed on your computer. We also recommend commenting out this line from the pycocoevalcap library to avoid generating SPICE metrics, which can take longer and don't work for multi-frame situations.
If you want to run our code on Google Colab, we have provided three different notebooks in the colab
folder that can be used for training each model type and inference:
train_T5_Base.ipynb
: Allows for training EM-VLM4AD with the T5-Medium LM backbone.train_T5_Large.ipynb
: Allows for training EM-VLM4AD with the quantized T5-Large LM backbone.- Training hyperparameters are in the
Hyperparameters
section of the training Colab notebooks. This can allow you to resume training from a checkpoint and whether the LM should be freezed during training.
- Training hyperparameters are in the
eval.ipynb
Generates BLEU-4, METEOR, CIDER, and ROUGE_L metrics for a trained model. Make sure to specify model checkpoint being evaluated and what LM backbone is being used ('T5-Medium' or 'T5-Large') in theHyperparameters
section.- We recommend making a folder
DriveLM
in your Google Drive and uploading the model checkpoints and zipped data to this folder. An example directory should look like this:
└── DriveLM
├── data.zip
├── multi_frame_results/
├── T5-Medium/
├── T5-Large/