What’s the relation between the reference audio and the model in terms of quality? #645

knochenhans · 2024-12-18T13:53:38Z

knochenhans
Dec 18, 2024

Hi, first off, I have no machine learning background so the whole technical background is over my head to be honest. I’m mostly using TTS systems (coming from Piper TTS) for coding personal projects like blog and audiobook creation tools.

That being said, I wonder how do the reference audio and the actual model relate in terms of output quality? My initial impression was the reference audio just provides a kind of "audio skin" for the model, but after playing around with multiple reference audio files taken from TV, podcasts, and commercial audio books, I noticed the output quality actually varies greatly instead of just sounding different in terms of mood and personality. It’s literally a day and night difference sometimes.

Is this mostly about how clear the reference voice sounds (background noise, compression, microphone distance, etc.) or is the output also influence by how consistently the speaker intonates words and sentences?

I’m mainly asking to find out what to look out for when picking reference voices, maybe even from the same source. Are there any guidelines?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What’s the relation between the reference audio and the model in terms of quality? #645

{{title}}

Replies: 0 comments

Select a reply

What’s the relation between the reference audio and the model in terms of quality? #645

knochenhans Dec 18, 2024

Replies: 0 comments

knochenhans
Dec 18, 2024