Requirements and Installation

We provide the code and models for our ICASSP paper Adapting self-supervised models to multi-talker speech recognition using speaker embeddings.

Requirements and Installation

Python version == 3.7
torch==1.10.0, torchaudio==0.10.0

# Install fairseq
git clone -b multispk --single-branch https://github.com/HuangZiliAndy/fairseq.git
cd fairseq
pip install --editable ./

# Install apex
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \

pip install -r requirements.txt

Data prepare

# Prepare LibriMix (https://github.com/JorisCos/LibriMix)
# We only need 16k max condition in our experiment, and train-360
# is not needed.

# Install Kaldi (https://github.com/kaldi-asr/kaldi)

# Link utils to current directory
ln -s <kaldi_dir>/egs/wsj/s5/utils .

# Follow the following two scripts to prepare fairseq style
# training data for LibriMix

# The difference between the following two scripts is that
# the former makes use of force alignment results to create
# tight boundary (utterance-based evaluation)
./myscripts/LibriMix/prepare_librimix.sh
./myscripts/LibriMix/prepare_librimix_full_len.sh

Extract speaker embeddings for enrollment utterances. We use 15s speech from LibriVox (not in LibriSpeech) LS 15 seconds enrollment as enrollment utterances. We also offer extracted x-vector embeddings.

Training

Download wavLM models and put it under downloads directory

We offer a few example scripts for training.

# Utterance-based evaluation (wavLM Base+ without speaker embedding)
./train_scripts/LS_wavLM.sh

# Utterance-based evaluation (wavLM Base+ with speaker embedding)
./train_scripts/LS_wavLM_spk.sh

# Utterance group-based evaluation (wavLM Base+ with speaker embedding)
./train_scripts/LS_full_len_wavLM_spk.sh

# Utterance group-based evaluation (wavLM Base+ with speaker embedding + Joint Speaker Modeling (JSM))
./train_scripts/LS_full_len_wavLM_spk_JSM.sh

Evaluation

# Utterance-based evaluation with and w/o speaker embedding
./eval_scripts/LS.sh

# Utterance group-based evaluation (wavLM Base+ with speaker embedding)
./eval_scripts/LS_full_len.sh

# Utterance group-based evaluation (wavLM Base+ with speaker embedding + JSM)
./eval_scripts/LS_full_len_JSM.sh

Pretrained models

Utterance-based evaluation (wavLM Base+ without speaker embedding)

Utterance-based evaluation (wavLM Base+ with speaker embedding)

Utterance group-based evaluation (wavLM Base+ with speaker embedding)

Utterance group-based evaluation (wavLM Base+ with speaker embedding + JSM)

When you are doing inference using the pretrained model, please first convert the model using

python myscripts/convert_model.py <model_dir>/checkpoint_last.pt downloads/WavLM-Base+.pt <model_dir>/checkpoint_last_tmp.pt
mv <model_dir>/checkpoint_last_tmp.pt <model_dir>/checkpoint_last.pt

Citation

Please cite as:

@inproceedings{huang2023adapting,
  title={Adapting self-supervised models to multi-talker speech recognition using speaker embeddings},
  author={Huang, Zili and Raj, Desh and Garc{\'\i}a, Paola and Khudanpur, Sanjeev},
  booktitle={IEEE ICASSP},
  year={2023},
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
config		config
eval_scripts		eval_scripts
myscripts		myscripts
train_scripts		train_scripts
README.md		README.md
dict.ltr.txt		dict.ltr.txt
path.sh		path.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Requirements and Installation

Data prepare

Training

Evaluation

Pretrained models

Citation

About

Releases

Packages

Languages

HuangZiliAndy/SSL_for_multitalker

Folders and files

Latest commit

History

Repository files navigation

Requirements and Installation

Data prepare

Training

Evaluation

Pretrained models

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages