In our recent paper we propose the YourTTS model. YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.
Visit our website for audio samples.
All of our experiments were implemented on the Coqui TTS repo.
Demo | URL |
---|---|
Zero-Shot TTS | link |
Zero-Shot VC | link |
All the released checkpoints are licensed under CC BY-NC-ND 4.0
Model | URL |
---|---|
Speaker Encoder | link |
Exp 1. YourTTS-EN(VCTK) | link |
Exp 1. YourTTS-EN(VCTK) + SCL | link |
Exp 2. YourTTS-EN(VCTK)-PT | link |
Exp 2. YourTTS-EN(VCTK)-PT + SCL | link |
Exp 3. YourTTS-EN(VCTK)-PT-FR | link |
Exp 3. YourTTS-EN(VCTK)-PT-FR SCL | link |
Exp 4. YourTTS-EN(VCTK+LibriTTS)-PT-FR SCL | link |
To insure replicability, we make the audios used to generate the MOS available here. In addition, we provide the MOS for each audio here.
To re-generate our MOS results, follow the instructions here. To predict the test sentences and generate the SECS, please use the Jupyter Notebooks available here.