WavLM : WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing
Official PyTorch implementation and pretrained models of WavLM
- Oct 2021: release preprint in arXiv
Model | Pretraining Dataset | Finetuning Dataset | Model |
---|---|---|---|
WavLM Base | 960 hrs LibriSpeech | - | coming soon |
WavLM Base+ | 60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli | - | coming soon |
WavLM Large | 60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli | - | coming soon |
The authors are preparing simple, clear, and well-documented fine-tuning code of WavLM. The pre-trained models will also release as long as the fine-tuning code is done. Stay tuned!
We also evaluate our models on typical speech processing benchmarks.
Evaluate on the VoxCeleb
Model | Fix pre-train | Vox1-O | Vox1-E | Vox1-H |
---|---|---|---|---|
ECAPA-TDNN | - | 0.87 | 1.12 | 2.12 |
HuBERT large | Yes | 0.888 | 0.912 | 1.853 |
Wav2Vec2.0 (XLSR) | Yes | 0.915 | 0.945 | 1.895 |
UniSpeech-SAT large | Yes | 0.771 | 0.781 | 1.669 |
WavLM large | Yes | 0.638 | 0.687 | 1.457 |
HuBERT large | No | 0.585 | 0.654 | 1.342 |
Wav2Vec2.0 (XLSR) | No | 0.564 | 0.605 | 1.23 |
UniSpeech-SAT large | No | 0.564 | 0.561 | 1.23 |
WavLM large | No | 0.431 | 0.538 | 1.154 |
Evaluation on the LibriCSS
Model | 0S | 0L | OV10 | OV20 | OV30 | OV40 |
---|---|---|---|---|---|---|
Conformer (SOTA) | 4.5 | 4.4 | 6.2 | 8.5 | 11 | 12.6 |
HuBERT base | 4.7 | 4.6 | 6.1 | 7.9 | 10.6 | 12.3 |
UniSpeech-SAT base | 4.4 | 4.4 | 5.4 | 7.2 | 9.2 | 10.5 |
UniSpeech-SAT large | 4.3 | 4.2 | 5.0 | 6.3 | 8.2 | 8.8 |
WavLM base+ | 4.5 | 4.4 | 5.6 | 7.5 | 9.4 | 10.9 |
WavLM large | 4.2 | 4.1 | 4.8 | 5.8 | 7.4 | 8.5 |
Evaluation on the CALLHOME
Model | spk_2 | spk_3 | spk_4 | spk_5 | spk_6 | spk_all |
---|---|---|---|---|---|---|
EEND-vector clustering | 7.96 | 11.93 | 16.38 | 21.21 | 23.1 | 12.49 |
EEND-EDA clustering (SOTA) | 7.11 | 11.88 | 14.37 | 25.95 | 21.95 | 11.84 |
HuBERT base | 7.93 | 12.07 | 15.21 | 19.59 | 23.32 | 12.63 |
HuBERT large | 7.39 | 11.97 | 15.76 | 19.82 | 22.10 | 12.40 |
UniSpeech-SAT large | 5.93 | 10.66 | 12.9 | 16.48 | 23.25 | 10.92 |
WavLM Base | 6.99 | 11.12 | 15.20 | 16.48 | 21.61 | 11.75 |
WavLm large | 6.46 | 10.69 | 11.84 | 12.89 | 20.70 | 10.35 |
Evaluate on the LibriSpeech
Please visit here for more interesting and effective pre-trained models
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ project.
Microsoft Open Source Code of Conduct
If you find our work is useful in your research, please cite the following paper:
@article{Chen2021WavLM,
title = {WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing},
author = {Sanyuan Chen and Chengyi Wang and Zhengyang Chen and Yu Wu and Shujie Liu and Zhuo Chen and Jinyu Li and Naoyuki Kanda and Takuya Yoshioka and Xiong Xiao and Jian Wu and Long Zhou and Shuo Ren and Yanmin Qian and Yao Qian and Jian Wu and Micheal Zeng and Furu Wei},
eprint={2110.13900},
archivePrefix={arXiv},
primaryClass={cs.CL},
year={2021}
}
For help or issues using WavLM models, please submit a GitHub issue.
For other communications related to WavLM, please contact Yu Wu ([email protected]
).