Skip to content

AudioLLMs/AudioBench

Repository files navigation

Prometheus-Logo

πŸ”₯ AudioBench πŸ”₯

arXiv Hugging Face Organization License

⚑ A repository for evaluating AudioLLMs in various tasks πŸš€ ⚑
⚑ AudioBench: A Universal Benchmark for Audio Large Language Models πŸš€ ⚑

Change log

  • Sep 2024: Add MuChoMusic dataset for music evaluation (multiple choice questions).
  • Aug 2024: Support a couple of speech translation datasets. Update the evaluation script for several MCQ evaluation.
  • Aug 2024: Leadboard is live. Check it out here.
  • July 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
  • July 2024: Support all 26 datasets listed in AudioBench manuscript.

πŸ”§ Installation

Installation with pip:

pip install -r requirements.txt

For model-as-judge evaluation, we serve the judgement model as a service via vllm on port 5000.

⏩ Quick Start

The example is hosting a Llama-3-70B-Instruct model and running the cascade Whisper + Llama-3 model.

# Step 1:
# Server the model as judge
# It will auto-download the model and may requires verification from Hugging Face.
# In the demo, we use 2 H100 80G in order to host the model.
# For smaller VRAM, you may need to reduce the model size.
# bash host_model_judge_llama_3_70b_instruct.sh

# Another option (recommended) is to use the quantized model which could be hosted on 2*40G GPUs.
bash host_model_judge_llama_3_70b_instruct_awq.sh

# Step 2:
# The example is done with 3 H100 80G GPUs.
# The AudioLLMs model inference is done on GPU 2 since GPU 0&1 is used to host model-as-judge services.
# This is a setting for just using 50 samples for evaluation.
MODEL_NAME=whisper_large_v3_with_llama_3_8b_instruct
GPU=2
BATCH_SIZE=1
METRICS=llama3_70b_judge_binary
OVERWRITE=True
NUMBER_OF_SAMPLES=50

DATASET=cn_college_listen_mcq_test

bash eval.sh $DATASET $MODEL_NAME $GPU $BATCH_SIZE $OVERWRITE $METRICS $NUMBER_OF_SAMPLES

# Step 3:
# The results would be like:
#    "llama3_70b_judge_binary": {
#        "judge_score": 90.0,
#        "success_rate": 1.0
#    }
#}
# This indicates that the cascade model can achieve 90% accuracy on the MCQ task for English listening test.

The example is how to get started. To evaluate on the full datasets, please refer to Examples.

# After model weight download, run the evaluation script for all datasets
bash examples/eval_salmonn_7b.sh

πŸ“š Supported Models and Datasets

Datasets

Speech Understanding

Audio Scene Understanding

Voice Understanding

ASR-English

Dataset Metrics Status
LibriSpeech-Clean Word-Error-Rate βœ…
LibriSpeech-Other Word-Error-Rate βœ…
CommonVoice-15-EN Word-Error-Rate βœ…
Peoples-Speech Word-Error-Rate βœ…
GigaSpeech Word-Error-Rate βœ…
Earning21 Word-Error-Rate βœ…
Earning22 Word-Error-Rate βœ…
Tedlium3 Word-Error-Rate βœ…
Tedlium3-Longform Word-Error-Rate βœ…
export MODEL_NAME=whisper_large_v3_with_llama_3_8b_instruct
export GPU=3
export BATCH_SIZE=1
export OVERWRITE=False
export NUMBER_OF_SAMPLES=-1
bash examples/eval_sqa.sh

SQA

Dataset Metrics Status
CN-College-Listen Model-as-Judge (binary) βœ…
SLUE-P2-SQA5 Model-as-Judge βœ…
DREAM-TTS Model-as-Judge (binary) βœ…
Public-SG-SpeechQA Model-as-Judge βœ…
Spoken-SQuAD Model-as-Judge βœ…
bash examples/eval_sqa.sh

SI

Dataset Metrics Status
OpenHermes-Audio Model-as-Judge βœ…
ALPACA-Audio Model-as-Judge βœ…
bash examples/eval_si.sh

ST

Dataset Metrics Status
CoVost2-English-Indonesian BLEU βœ…
CoVost2-English-Chinese BLEU βœ…
CoVost2-English-Tamil BLEU βœ…
CoVost2-Indonesian-English BLEU βœ…
CoVost2-Chinese-English BLEU βœ…
CoVost2-Tamil-English BLEU βœ…
bash examples/eval_st.sh

ASR-Chinese

Dataset Metrics Status
AISHELL-ASR-ZH Word-Error-Rate βœ…
bash examples/eval_asr_cn.sh

AC

Dataset Metrics Status
AudioCaps Model-as-Judge / METEOR βœ…
WavCaps Model-as-Judge / METEOR βœ…
bash examples/eval_ac.sh

ASQA

Dataset Metrics Status
Clotho-AQA Model-as-Judge βœ…
AudioCaps-QA Model-as-Judge βœ…
WavCaps-QA Model-as-Judge βœ…
bash examples/eval_asqa.sh

AR

Dataset Metrics Status
VoxCeleb-Accent Model-as-Judge βœ…
bash examples/eval_ar.sh

GR

Dataset Metrics Status
VoxCeleb-Gender Model-as-Judge (binary) βœ…
IEMOCAP-Gender Model-as-Judge (binary) βœ…
bash examples/eval_gr.sh

ER

Dataset Metrics Status
IEMOCAP-Emotion Model-as-Judge (binary) βœ…
MELD-Sentiment Model-as-Judge (binary) βœ…
MELD-Emotion Model-as-Judge (binary) βœ…
bash examples/eval_er.sh

Music

Dataset Metrics Status
MuChoMusic Model-as-Judge (binary) βœ…
bash examples/eval_music.sh

Models

Name Size Notes Status
Whisper-Large+Llama-3-8B-Instruct ~8B Cascade Models βœ…
SALMONN ~7B End2End βœ…
Qwen-Audio ~8B End2End TODO
WavLM ~7B End2End TODO
Qwen2-Audio ~8B End2End TODO

More models are accessible in this survey. To add a new model, please refer to Adding a New Model.

πŸ“– Citation

If you find our work useful, please consider citing our paper!

@article{wang2024audiobench,
  title={AudioBench: A Universal Benchmark for Audio Large Language Models},
  author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
  journal={arXiv preprint arXiv:2406.16020},
  year={2024}
}

Researchers, companies or groups that are using AudioBench: