A one stop shop to track all open-access/ source TTS models as they come out. Feel free to make a PR for all those that aren't linked here.
This is aimed as a resource to increase awareness for these models and to make it easier for researchers, developers, and enthusiasts to stay informed about the latest advancements in the field.
Note
This repo will only track open source/access codebase TTS models. More motivation for everyone to open-source! 🤗
Name | GitHub 💻 | Weights ⚖ | License 🧾 | Fine-tune 👤 | Languages | Paper 📄 | Demo 🗣️ | Issues 📚 | Processor ⚡ | Word pronunciation adjustment 👄 | Insta-clone 👥 | Emotional control 🎭 | Prompting 📖 | Streaming support 🌊 | Audio control 🎚 | S2S support 🦜 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
XTTS | Repo | 🤗 Hub | CPML | Yes | Multilingual | Technical notes | 🤗 Space | |||||||||
TorToiSe TTS | Repo | 🤗 Hub | Apache 2.0 | Yes | English | Technical report | 🤗 Space | |||||||||
VITS/ MMS-TTS | Repo | 🤗 Hub / MMS | Apache 2.0 | Yes | English | Paper | 🤗 Space | |||||||||
Pheme | Repo | 🤗 Hub | CC-BY | Yes | English | Paper | 🤗 Space | |||||||||
OpenVoice | Repo | 🤗 Hub | CC-BY-NC 4.0 | No | ZH + EN | Paper | 🤗 Space | |||||||||
IMS-Toucan | Repo | GH release | Apache 2.0 | Yes | Multilingual | Paper | 🤗 Space | |||||||||
Matcha-TTS | Repo | GDrive | MIT | Yes | English | Paper | 🤗 Space | GPL-licensed phonemizer | ||||||||
pflowTTS | Unofficial Repo | GDrive | MIT | Yes | English | Paper | Not Available | GPL-licensed phonemizer | ||||||||
StyleTTS 2 | Repo | 🤗 Hub | MIT | Yes | English | Paper | 🤗 Space | GPL-licensed phonemizer | ||||||||
VALL-E | Unofficial Repo | Not Available | MIT | Yes | NA | Paper | Not Available | |||||||||
HierSpeech++ | Repo | GDrive | CC-BY-NC-SA 4.0 | No | KR + EN | Paper | 🤗 Space | |||||||||
Bark | Repo | 🤗 Hub | MIT | No | Multilingual | Paper | 🤗 Space | |||||||||
EmotiVoice | Repo | GDrive | Apache 2.0 | Yes | ZH + EN | Not Available | Not Available | Separate GUI agreement | ||||||||
Amphion | Repo | 🤗 Hub | MIT | No | Multilingual | Paper | 🤗 Space | |||||||||
xVASynth | Repo | GH commit | GPL-3.0 | Yes | Multilingual | Paper | Not Available | Copyright materials used for training. | CPU / CUDA | ARPAbet | 4-type 😡😃 😭😯 per-phoneme |
speed / pitch / energy 🎚 per-phoneme |
🦜 | |||
OverFlow TTS | Repo | GitHub | MIT | Yes | English | Paper | GH Pages | |||||||||
Neural-HMM TTS | Repo | GitHub | MIT | Yes | English | Paper | GH Pages | |||||||||
Tacotron 2 | Unofficial Repo | GDrive | BSD-3 | Yes | English | Paper | Webpage | |||||||||
Glow-TTS | Repo | GDrive | MIT | Yes | English | Paper | GH Pages | |||||||||
Silero | Repo | GH links | CC BY-NC-SA | No | EM + DE + ES + EA | Not Available | Not Available | Non Commercial | ||||||||
MahaTTS | Repo | 🤗 Hub | Apache 2.0 | No | English, Hindi, Indian English, Bengali, Tamil, Telugu, Punjabi, Marathi, Gujarati, Assamese | Not Available | Recordings, Colab |
Capability specifics
Name | Processor ⚡ |
Phonetic alphabet 👄 |
Insta-clone 👥 |
Emotional control 🎭 |
Prompting 📖 |
Streaming support 🌊 |
Speech control 🎚 |
S2S support 🦜 |
---|---|---|---|---|---|---|---|---|
XTTS | ||||||||
TorToiSe TTS | ||||||||
VITS/ MMS-TTS | ||||||||
Pheme | ||||||||
OpenVoice | ||||||||
IMS-Toucan | ||||||||
Matcha-TTS | ||||||||
pflowTTS | ||||||||
StyleTTS 2 | ||||||||
VALL-E | ||||||||
HierSpeech++ | ||||||||
Bark | ||||||||
EmotiVoice | ||||||||
Amphion | ||||||||
xVASynth | CPU / CUDA | ARPAbet | 4-type 🎭 😡😃😭😯 per‑phoneme |
speed / pitch / energy / 🎭 🎚 per‑phoneme |
🦜 | |||
OverFlow TTS | ||||||||
Neural-HMM TTS | ||||||||
Tacotron 2 | ||||||||
Glow-TTS | ||||||||
Silero | ||||||||
MahaTTS |
- Processor - CPU/CUDA/ROCm (single/multi)
- Phonetic alphabet - None/IPA/ARPAbet/ (Phonetic transcription that allows to control pronunciation of certain words)
- Insta-clone - Yes/No (Quick voice clone using a few audio samples)
- Emotional control - Yes/Strict/No (Strict, as in has no ability to go in-between states)
- Prompting - Yes/No (A side effect of narrator based datasets and a way to affect the emotional state, ElevenLabs docs)
- Streaming support - Yes/No (If it is possible to playback audio that is still being generated)
- Speech control - speed/pitch/ (Ability to change the pitch, duration, energy and/or emotion of generated speech)
- Speech-To-Speech support - Yes/No (Streaming support implies real-time S2S)
Help make this list more complete. Create demos on the Hugging Face Hub and link them here :) Got any questions? Drop me a DM on Twitter @reach_vb.