Skip to content

ADAPTING SELF-SUPERVISED MODELS TO MULTI-TALKER SPEECH RECOGNITION USING SPEAKER EMBEDDINGS

Notifications You must be signed in to change notification settings

HuangZiliAndy/SSL_for_multitalker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

We provide the code and models for our ICASSP paper Adapting self-supervised models to multi-talker speech recognition using speaker embeddings.

Requirements and Installation

  • Python version == 3.7
  • torch==1.10.0, torchaudio==0.10.0
# Install fairseq
git clone -b multispk --single-branch https://github.com/HuangZiliAndy/fairseq.git
cd fairseq
pip install --editable ./

# Install apex
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \

pip install -r requirements.txt 

Data prepare

# Prepare LibriMix (https://github.com/JorisCos/LibriMix)
# We only need 16k max condition in our experiment, and train-360
# is not needed.

# Install Kaldi (https://github.com/kaldi-asr/kaldi)

# Link utils to current directory
ln -s <kaldi_dir>/egs/wsj/s5/utils .

# Follow the following two scripts to prepare fairseq style
# training data for LibriMix

# The difference between the following two scripts is that
# the former makes use of force alignment results to create
# tight boundary (utterance-based evaluation)
./myscripts/LibriMix/prepare_librimix.sh
./myscripts/LibriMix/prepare_librimix_full_len.sh

Extract speaker embeddings for enrollment utterances. We use 15s speech from LibriVox (not in LibriSpeech) LS 15 seconds enrollment as enrollment utterances. We also offer extracted x-vector embeddings.

Training

Download wavLM models and put it under downloads directory

We offer a few example scripts for training.

# Utterance-based evaluation (wavLM Base+ without speaker embedding)
./train_scripts/LS_wavLM.sh

# Utterance-based evaluation (wavLM Base+ with speaker embedding)
./train_scripts/LS_wavLM_spk.sh

# Utterance group-based evaluation (wavLM Base+ with speaker embedding)
./train_scripts/LS_full_len_wavLM_spk.sh

# Utterance group-based evaluation (wavLM Base+ with speaker embedding + Joint Speaker Modeling (JSM))
./train_scripts/LS_full_len_wavLM_spk_JSM.sh

Evaluation

# Utterance-based evaluation with and w/o speaker embedding
./eval_scripts/LS.sh

# Utterance group-based evaluation (wavLM Base+ with speaker embedding)
./eval_scripts/LS_full_len.sh

# Utterance group-based evaluation (wavLM Base+ with speaker embedding + JSM)
./eval_scripts/LS_full_len_JSM.sh

Pretrained models

Utterance-based evaluation (wavLM Base+ without speaker embedding)

Utterance-based evaluation (wavLM Base+ with speaker embedding)

Utterance group-based evaluation (wavLM Base+ with speaker embedding)

Utterance group-based evaluation (wavLM Base+ with speaker embedding + JSM)

When you are doing inference using the pretrained model, please first convert the model using

python myscripts/convert_model.py <model_dir>/checkpoint_last.pt downloads/WavLM-Base+.pt <model_dir>/checkpoint_last_tmp.pt
mv <model_dir>/checkpoint_last_tmp.pt <model_dir>/checkpoint_last.pt

Citation

Please cite as:

@inproceedings{huang2023adapting,
  title={Adapting self-supervised models to multi-talker speech recognition using speaker embeddings},
  author={Huang, Zili and Raj, Desh and Garc{\'\i}a, Paola and Khudanpur, Sanjeev},
  booktitle={IEEE ICASSP},
  year={2023},
}

About

ADAPTING SELF-SUPERVISED MODELS TO MULTI-TALKER SPEECH RECOGNITION USING SPEAKER EMBEDDINGS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published