ADU-Bench: Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

This repository provides the pytorch implementatin of our work: ADU-Bench: Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models.

Abstract

Large Audio-Language Models (LALMs) have unclocked audio dialogue capabilities, where audio dialogues are a direct exchange of spoken language between LALMs and humans. Recent advances, such as GPT-4o, have enabled LALMs in back-and-forth audio dialogues with humans. This progression not only underscores the potential of LALMs but also broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we firstly propose the evaluation of ambiguity handling in audio dialogues that expresses different intentions beyond the same literal meaning of sentences, e.g., "Really!?" with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments conducted on 13 LALMs, our analysis reveals that there is still considerable room for improvement in the audio dialogue understanding abilities of existing LALMs. In particular, they struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones.

ADU-Bench

Load the ADU-Bench dataset directly from the official page on Hugging Face.

To evaluate your model using ADU-Bench, follow these steps:

Write the code to load your model in the generate.py file. We provide three examples for blsp, SALMONN, and Qwen-Audio-Chat.
Generate your model's responses on the ADU-Bench dataset by running the following command:
```
bash scripts/generate.sh
```
Once the responses are generated, calculate the evaluation score by executing:
```
bash scripts/evaluation.sh
```

These steps will allow you to benchmark your model's performance on the ADU-Bench dataset.

Citation

@article{gao2024benchmarking,
  title={Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models},
  author={Gao, Kuofeng and Xia, Shu-Tao and Xu, Ke and Torr, Philip and Gu, Jindong},
  journal={arXiv preprint arXiv:2412.05167},
  year={2024}
}

Acknowledgements

This respository is mainly based on blsp, SALMONN, and Qwen-Audio. Thanks for their wonderful works!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Qwen-Audio-Chat		Qwen-Audio-Chat
SALMONN		SALMONN
assets		assets
blsp		blsp
scripts		scripts
README.md		README.md
evaluation.py		evaluation.py
generate.py		generate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ADU-Bench: Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Abstract

ADU-Bench

Citation

Acknowledgements

About

Releases

Packages

Languages

KuofengGao/ADU-Bench

Folders and files

Latest commit

History

Repository files navigation

ADU-Bench: Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Abstract

ADU-Bench

Citation

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages