What’s the relation between the reference audio and the model in terms of quality? #645
Unanswered
knochenhans
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, first off, I have no machine learning background so the whole technical background is over my head to be honest. I’m mostly using TTS systems (coming from Piper TTS) for coding personal projects like blog and audiobook creation tools.
That being said, I wonder how do the reference audio and the actual model relate in terms of output quality? My initial impression was the reference audio just provides a kind of "audio skin" for the model, but after playing around with multiple reference audio files taken from TV, podcasts, and commercial audio books, I noticed the output quality actually varies greatly instead of just sounding different in terms of mood and personality. It’s literally a day and night difference sometimes.
Is this mostly about how clear the reference voice sounds (background noise, compression, microphone distance, etc.) or is the output also influence by how consistently the speaker intonates words and sentences?
I’m mainly asking to find out what to look out for when picking reference voices, maybe even from the same source. Are there any guidelines?
Beta Was this translation helpful? Give feedback.
All reactions