In our recent paper we propose the YourTTS model. YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.
Come try our latest and greatest fullband english only model https://coqui.ai/
Visit our website for audio samples.
All of our experiments were implemented on the Coqui TTS repo.
Demo | URL |
---|---|
Zero-Shot TTS | link |
Zero-Shot VC | link |
Zero-Shot VC - Experiment 1 (trained with just VCTK) | link |
All the released checkpoints are licensed under CC BY-NC-ND 4.0
Model | URL |
---|---|
Speaker Encoder | link |
Exp 1. YourTTS-EN(VCTK) | link |
Exp 1. YourTTS-EN(VCTK) + SCL | link |
Exp 2. YourTTS-EN(VCTK)-PT | link |
Exp 2. YourTTS-EN(VCTK)-PT + SCL | link |
Exp 3. YourTTS-EN(VCTK)-PT-FR | link |
Exp 3. YourTTS-EN(VCTK)-PT-FR SCL | link |
Exp 4. YourTTS-EN(VCTK+LibriTTS)-PT-FR SCL | link |
To use the 🐸 TTS version v0.7.0 released YourTTS model for Text-to-Speech use the following command:
tts --text "This is an example!" --model_name tts_models/multilingual/multi-dataset/your_tts --speaker_wav target_speaker_wav.wav --language_idx "en"
Considering the "target_speaker_wav.wav" an audio sample from the target speaker.
To use the 🐸 TTS released YourTTS model for voice conversion use the following command:
tts --model_name tts_models/multilingual/multi-dataset/your_tts --speaker_wav target_speaker_wav.wav --reference_wav target_content_wav.wav --language_idx "en"
Considering the "target_content_wav.wav" as the reference wave file to convert into the voice of the "target_speaker_wav.wav" speaker.
To insure replicability, we make the audios used to generate the MOS available here. In addition, we provide the MOS for each audio here.
To re-generate our MOS results, follow the instructions here. To predict the test sentences and generate the SECS, please use the Jupyter Notebooks available here.
LibriTTS (test clean): 1188, 1995, 260, 1284, 2300, 237, 908, 1580, 121 and 1089
VCTK: p261, p225, p294, p347, p238, p234, p248, p335, p245, p326 and p302
MLS Portuguese: 12710, 5677, 12249, 12287, 9351, 11995, 7925, 3050, 4367 and 1306
The article was made using my Coqui TTS fork on the branch multilingual-torchaudio-SE.
To replicate the training you can use this branch and with the config.json available with each checkpoint use:
python3 TTS/bin/train_tts.py --config_path config.json
If you want to use the latest version of the Coqui TTS you can get the config.json from the Coqui released model.
With config.json in hand, you first need to adjust some config.json paths. For example, "datasets", "output_path" and "d_vector_file".
In "d_vector_file" you need to pass the speaker embeddings of the speakers. To extract the speaker's embeddings use the following command:
python3 TTS/bin/compute_embeddings.py model_se.pth.tar config_se.json config.json d_vector_file.json
"model_se.pth.tar" and "config_se.json" can be found in Coqui released model while config.json is the config you set the paths for.
@ARTICLE{2021arXiv211202418C,
author = {{Casanova}, Edresson and {Weber}, Julian and {Shulby}, Christopher and {Junior}, Arnaldo Candido and {G{\"o}lge}, Eren and {Antonelli Ponti}, Moacir},
title = "{YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone}",
journal = {arXiv e-prints},
keywords = {Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing},
year = 2021,
month = dec,
eid = {arXiv:2112.02418},
pages = {arXiv:2112.02418},
archivePrefix = {arXiv},
eprint = {2112.02418},
primaryClass = {cs.SD},
adsurl = {https://ui.adsabs.harvard.edu/abs/2021arXiv211202418C},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
@inproceedings{casanova2022yourtts,
title={Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone},
author={Casanova, Edresson and Weber, Julian and Shulby, Christopher D and Junior, Arnaldo Candido and G{\"o}lge, Eren and Ponti, Moacir A},
booktitle={International Conference on Machine Learning},
pages={2709--2720},
year={2022},
organization={PMLR}
}