Skip to content

schowdhury671/meerkat

Repository files navigation

Logo

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

1* Sanjoy Chowdhury, 2* Sayan Nag, 3* Subhrajyoti Dasgupta, 4Jun Chen,

4 Mohamed Elhoseiny, 1 Ruohan Gao, 1 Dinesh Manocha

1 University of Maryland, 2 University of Toronto, 3 Mila and Université de Montréal, 4 KAUST

Meerkat is an audio-visual LLM equipped with a fine-grained understanding of image 🖼️ and audio 🎵, both spatially 🪐 and temporally 🕒.

📰 Paper 🗃️ Dataset 🌐 Project Page 🧱 Code

Table of Contents 📚

Model Architecture 💡

Model Architecture

Installation 🛠️

To install Meerkat, follow these steps:

# Clone the repository
git clone https://github.com/schowdhury671/meerkat.git

# Change to the Macaw-LLM directory
cd meerkat

# Install required packages
pip install -r requirements.txt

# Install ffmpeg
yum install ffmpeg -y

# Install apex
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install
cd ..

Usage 🚀

  1. Downloading dataset:

  2. Dataset preprocessing:

    • Extract frames and audio from videos
    • The JSONs and data files should be placed following the directory structure:
    data
    |--<dataset>
         |--<dataset>_train.json 
         |--<dataset_test>.json
         |--frames
               |--<image>.jpg
               |-- ...
         |--audios
               |--<audio>.wav
               |-- ...
    
    • Transform supervised data to dataset:
      python preprocess_data_supervised.py
      
  3. Training:

    • Execute the training script (you can specify the training parameters inside):
      ./train.sh
      
  4. Inference:

    • Execute the inference script (you can give any customized inputs inside):
      ./inference.sh
      

Results 📉

Results

Acknowledgements 🙏

We would like to express our gratitude to the Macaw-LLM and GOT repositories for their valuable contributions to Meerkat.

Citation

@inproceedings{chowdhury2024meerkat,
      title={Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time},
      author={Chowdhury, Sanjoy and Nag, Sayan and Dasgupta, Subhrajyoti and Chen, Jun and Elhoseiny, Mohamed and Gao, Ruohan and Manocha, Dinesh},
      journal={European Conference on Computer Vision (ECCV)},
      year={2024}
}