Skip to content

Official implementation of the INTERSPEECH 2024 paper: Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

License

Notifications You must be signed in to change notification settings

ductuantruong/tcm_add

Repository files navigation

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

This Repository contains the code and pretrained models for the following INTERSPEECH 2024 paper:

  • Title : Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection
  • Autor : Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng

Pretrained Model

The pretrained model XLSR can be found at link.

We have uploaded pretrained models of our experiments. You can download pretrained models from OneDrive.

Setting up environment

Python version: 3.7.16

Install PyTorch

pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

Install other libraries:

pip install -r requirements.txt

Install fairseq:

git clone https://github.com/facebookresearch/fairseq.git fairseq_dir
cd fairseq_dir
git checkout a54021305d6b3c
pip install --editable ./

Training & Testing on fixed-length input

To train and produce the score for LA set evaluation, run:

python main.py --algo 5

To train and produce the score for DF set evaluation, run:

python main.py --algo 3

Scoring

To get evaluation results of minimum t-DCF and EER (Equal Error Rate), follow these steps:

cd 2021/eval-package
python main.py --cm-score-file your_LA_score.txt --track LA --subset eval # For LA track evaluation
python main.py --cm-score-file your_DF_score.txt --track DF --subset eval # For DF track evaluation

Inference

To run inference on a single wav file with the pretrained model, run:

python inference.py --ckpt_path=path_to/model.pth --threshold=-3.73 --wav_path=path_to/audio.flac

The threshold can be obtained when calculating EER on LA or DF set. In this example, the threshold is from DF set evaluation.

Citation

If you find our repository valuable for your work, please consider giving a start to this repo and citing our paper:

@inproceedings{truong24b_interspeech,
  title     = {Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection},
  author    = {Duc-Tuan Truong and Ruijie Tao and Tuan Nguyen and Hieu-Thi Luong and Kong Aik Lee and Eng Siong Chng},
  year      = {2024},
  booktitle = {Interspeech 2024},
  pages     = {537--541},
  doi       = {10.21437/Interspeech.2024-659},
  issn      = {2958-1796},
}

Acknowledge

Our work is built upon the conformer-based-classifier-for-anti-spoofing We also follow some parts of the following codebases:

SSL_Anti-spoofing (for training pipeline).

conformer (for Conformer model architechture).

DHVT (for Head Token desgin).

Thanks for these authors for sharing their work!

About

Official implementation of the INTERSPEECH 2024 paper: Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages