In our recent paper we propose the YourTTS model. YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.
Visit our website for audio samples.
All of our experiments were implemented on the Coqui TTS repo.
Demo | URL |
Zero-Shot TTS | link |
Zero-Shot VC | link |
All the released checkpoints are licensed under CC BY-NC-ND 4.0
Model | URL |
Speaker Encoder | link |
Exp 1. YourTTS-EN(VCTK) | link |
Exp 1. YourTTS-EN(VCTK) + SCL | link |
Exp 2. YourTTS-EN(VCTK)-PT | link |
Exp 2. YourTTS-EN(VCTK)-PT + SCL | link |
Exp 3. YourTTS-EN(VCTK)-PT-FR | link |
Exp 3. YourTTS-EN(VCTK)-PT-FR SCL | link |
Exp 4. YourTTS-EN(VCTK+LibriTTS)-PT-FR SCL | link |
To use the 🐸 TTS released YourTTS model for Text-to-Speech use the following command:
tts --text "This is an example!" --model_name tts_models/multilingual/multi-dataset/your_tts --speaker_wav target_speaker_wav.wav --language_idx "en"
Considering the "target_speaker_wav.wav" an audio sample from the target speaker.
To use the 🐸 TTS released YourTTS model for voice conversion use the following command:
tts --model_name tts_models/multilingual/multi-dataset/your_tts --speaker_wav target_speaker_wav.wav --reference_wav target_content_wav.wav --language_idx "en"
Considering the "target_content_wav.wav" as the reference wave file to convert into the voice of the "target_speaker_wav.wav" speaker.
To insure replicability, we make the audios used to generate the MOS available here. In addition, we provide the MOS for each audio here.
To re-generate our MOS results, follow the instructions here. To predict the test sentences and generate the SECS, please use the Jupyter Notebooks available here.
LibriTTS (test clean): 1188, 1995, 260, 1284, 2300, 237, 908, 1580, 121 and 1089
VCTK: p261, p225, p294, p347, p238, p234, p248, p335, p245, p326 and p302
MLS Portuguese: 12710, 5677, 12249, 12287, 9351, 11995, 7925, 3050, 4367 and 13069
author = {{Casanova}, Edresson and {Weber}, Julian and {Shulby}, Christopher and {Junior}, Arnaldo Candido and {G{\"o}lge}, Eren and {Antonelli Ponti}, Moacir},
title = "{YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone}",
journal = {arXiv e-prints},
keywords = {Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing},
year = 2021,
month = dec,
eid = {arXiv:2112.02418},
pages = {arXiv:2112.02418},
archivePrefix = {arXiv},
eprint = {2112.02418},
primaryClass = {cs.SD},
adsurl = {},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}