Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

This Repository contains the code and pretrained models for the following INTERSPEECH 2024 paper:

Title : Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection
Autor : Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng

Pretrained Model

The pretrained model XLSR can be found at link.

We have uploaded pretrained models of our experiments. You can download pretrained models from OneDrive.

Setting up environment

Python version: 3.7.16

Install PyTorch

pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

Install other libraries:

pip install -r requirements.txt

Install fairseq:

git clone https://github.com/facebookresearch/fairseq.git fairseq_dir
cd fairseq_dir
git checkout a54021305d6b3c
pip install --editable ./

Training & Testing on fixed-length input

To train and produce the score for LA set evaluation, run:

python main.py --algo 5

To train and produce the score for DF set evaluation, run:

python main.py --algo 3

Scoring

To get evaluation results of minimum t-DCF and EER (Equal Error Rate), follow these steps:

cd 2021/eval-package
python main.py --cm-score-file your_LA_score.txt --track LA --subset eval # For LA track evaluation
python main.py --cm-score-file your_DF_score.txt --track DF --subset eval # For DF track evaluation

Inference

To run inference on a single wav file with the pretrained model, run:

python inference.py --ckpt_path=path_to/model.pth --threshold=-3.73 --wav_path=path_to/audio.flac

The threshold can be obtained when calculating EER on LA or DF set. In this example, the threshold is from DF set evaluation.

Citation

If you find our repository valuable for your work, please consider giving a start to this repo and citing our paper:

@inproceedings{truong24b_interspeech,
  title     = {Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection},
  author    = {Duc-Tuan Truong and Ruijie Tao and Tuan Nguyen and Hieu-Thi Luong and Kong Aik Lee and Eng Siong Chng},
  year      = {2024},
  booktitle = {Interspeech 2024},
  pages     = {537--541},
  doi       = {10.21437/Interspeech.2024-659},
  issn      = {2958-1796},
}

Acknowledge

Our work is built upon the conformer-based-classifier-for-anti-spoofing We also follow some parts of the following codebases:

SSL_Anti-spoofing (for training pipeline).

conformer (for Conformer model architechture).

DHVT (for Head Token desgin).

Thanks for these authors for sharing their work!

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
2021 @ 9b33f5e		2021 @ 9b33f5e
audio_samples		audio_samples
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
RawBoost.py		RawBoost.py
conformer.py		conformer.py
data_utils.py		data_utils.py
eval.py		eval.py
inference.py		inference.py
main.py		main.py
model.py		model.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

Pretrained Model

Setting up environment

Training & Testing on fixed-length input

Scoring

Inference

Citation

Acknowledge

About

Releases

Packages

Languages

License

ductuantruong/tcm_add

Folders and files

Latest commit

History

Repository files navigation

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

Pretrained Model

Setting up environment

Training & Testing on fixed-length input

Scoring

Inference

Citation

Acknowledge

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages