Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding [NAACL 2025]

Xingjian Diao, Chunhui Zhang, Weiyi Wu, Zhongyu Ouyang, Peijun Qing, Ming Cheng, Soroush Vosoughi, Jiang Gui

Introduction

We introduce temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of Multimodal foundation models (MFMs). It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content, TWM optimizes the use of the model's limited capacity, enhancing its temporal modeling ability. This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across question answering, video captioning, and video-text retrieval tasks.

Project Overview

This code repository includes implementations of the Temporal Working Memory (TWM) mechanism algorithm applied to nine different state-of-the-art models. The steps to run the code are as follows:

Download the repository: Clone this repository to your local environment.
Data Preprocessing: Prepare data following the preprocessing steps in each original model repository.
Training Temporal Working Memory (TWM): For each model, adjust the number of training epochs and relevant model-specific hyperparameters in the main_alvs.py file within each model’s directory. Follow the recommendations in each model's original paper for parameter settings, and then train TWM.
Inference: Set epochs = 0 in each model's main_alvs.py file, and run to utilize TWM.

References

[Paper](Model Name)

Acknowledgement

We thank the open-sourced paper mentioned above for the authors' outstanding work.

Citation

If our work has been helpful to your research, we would appreciate it if you could cite the following paper:

@inproceedings{diao2025twm,
title={Temporal Working Memory: Query-Guided Segment Refinement for
Enhanced Multimodal Understanding},
author={Diao, Xingjian and Zhang, Chunhui and Wu, Weiyi and Ouyang, Zhongyu and Qing, Peijun and Cheng, Ming and Vosoughi, Soroush and Gui, Jiang},
booktitle={Findings of the Association for Computational Linguistics: NAACL 2025},
year={2025}
}

Contact

If you have any questions, suggestions, or bug reports, please contact [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
GenerativeImage2Text		GenerativeImage2Text
LAST		LAST
LAVISH		LAVISH
MovieSeq		MovieSeq
NarrativeBridge		NarrativeBridge
STG-CMA		STG-CMA
VindLU		VindLU
figs		figs
testa		testa
v2tactiongraph		v2tactiongraph
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding [NAACL 2025]

Introduction

Project Overview

References

[Paper](Model Name)

Acknowledgement

Citation

Contact

About

Releases

Packages

Languages

xid32/NAACL_2025_TWM

Folders and files

Latest commit

History

Repository files navigation

Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding [NAACL 2025]

Introduction

Project Overview

References

[Paper](Model Name)

Acknowledgement

Citation

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages