Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

^1* Sanjoy Chowdhury, ^2* Sayan Nag, ^3* Subhrajyoti Dasgupta, ⁴Jun Chen,

⁴ Mohamed Elhoseiny, ¹ Ruohan Gao, ¹ Dinesh Manocha

¹ University of Maryland, ² University of Toronto, ³ Mila and Université de Montréal, ⁴ KAUST

Meerkat is an audio-visual LLM equipped with a fine-grained understanding of image 🖼️ and audio 🎵, both spatially 🪐 and temporally 🕒.

📰 Paper 🗃️ Dataset 🌐 Project Page 🧱 Code

Model Architecture 💡

Installation 🛠️

To install Meerkat, follow these steps:

# Clone the repository
git clone https://github.com/schowdhury671/meerkat.git

# Change to the Macaw-LLM directory
cd meerkat

# Install required packages
pip install -r requirements.txt

# Install ffmpeg
yum install ffmpeg -y

# Install apex
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install
cd ..

Usage 🚀

Downloading dataset:
- Please download the dataset JSONs from here: Sharepoint Link

Dataset preprocessing:

Extract frames and audio from videos
The JSONs and data files should be placed following the directory structure:

data
|--<dataset>
     |--<dataset>_train.json 
     |--<dataset_test>.json
     |--frames
           |--<image>.jpg
           |-- ...
     |--audios
           |--<audio>.wav
           |-- ...

Transform supervised data to dataset:
```
python preprocess_data_supervised.py
```

Training:
- Execute the training script (you can specify the training parameters inside):
```
./train.sh
```
Inference:
- Execute the inference script (you can give any customized inputs inside):
```
./inference.sh
```

Results 📉

Acknowledgements 🙏

We would like to express our gratitude to the Macaw-LLM and GOT repositories for their valuable contributions to Meerkat.

Citation

@inproceedings{chowdhury2024meerkat,
      title={Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time},
      author={Chowdhury, Sanjoy and Nag, Sayan and Dasgupta, Subhrajyoti and Chen, Jun and Elhoseiny, Mohamed and Gao, Ruohan and Manocha, Dinesh},
      journal={European Conference on Computer Vision (ECCV)},
      year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
constants.py		constants.py
inference.sh		inference.sh
llm_trainer.py		llm_trainer.py
modeling.py		modeling.py
ot.py		ot.py
preprocess_data_supervised.py		preprocess_data_supervised.py
requirements.txt		requirements.txt
run_clm_llms.py		run_clm_llms.py
run_clm_llms_inference.py		run_clm_llms_inference.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

📰 Paper 🗃️ Dataset 🌐 Project Page 🧱 Code

Table of Contents 📚

Model Architecture 💡

Installation 🛠️

Usage 🚀

Results 📉

Acknowledgements 🙏

Citation

About

Releases

Packages

Contributors 3

Languages

License

schowdhury671/meerkat

Folders and files

Latest commit

History

Repository files navigation

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

📰 Paper 🗃️ Dataset 🌐 Project Page 🧱 Code

Table of Contents 📚

Model Architecture 💡

Installation 🛠️

Usage 🚀

Results 📉

Acknowledgements 🙏

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages