Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the motivation for inserting blank IDs between the input IPA-ids? #94

Open
dbkest opened this issue Aug 25, 2024 · 1 comment
Open

the motivation for inserting blank IDs between the input IPA-ids? #94

dbkest opened this issue Aug 25, 2024 · 1 comment
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@dbkest
Copy link

dbkest commented Aug 25, 2024

Hello, could you please help me understand the motivation for inserting blank IDs between the input IPA-ids? The implementation code can be found in text_mel_datamodule.py line216:

def get_text(self, text, add_blank=True):
text_norm, cleaned_text = text_to_sequence(text, self.cleaners)
if self.add_blank: #True
text_norm = intersperse(text_norm, 0)
text_norm = torch.IntTensor(text_norm)
return text_norm, cleaned_text

thanks.

@shivammehta25
Copy link
Owner

Hello, that is a great question!

TLDR:
The idea comes from multiple states per phone in a Hidden Markov Model (HMM) based speech synthesisers for better modelling. [Our previous work Neural-HMM and OverFlow have also used that.] Since Monotonic Alignment Search (MAS) (introduced in Glow-TTS) is a Viterbi approximation to the forward algorithm the idea has its root from the same literature. You can use multiple states to model the transition between different sounds.

More details:
In Statistical Parametric Speech Synthesis (SPSS) times (You can read more about it here in section 2.2 right below equation 2.28), people used multiple states to model each phoneme. They found it beneficial to model certain dynamic features with more states which were especially useful in modelling certain sounds for example plosives (In English: p, t, k, b, d, g), where you have silence, sudden burst in energy and then silence again. These were hard to model for a left-to-right algorithm with no skip (like the MAS) without multiple states representing them as each state had emission parameters.

Modern neural network-based speech synthesisers are much more powerful approximators. So, the idea behind adding an extra state is to provide a placeholder for the MAS to learn such dynamic variation and transition between sounds, where two states seem to be a nice compromise between having the model learn these dynamic variations when needed and jumping directly to next sound in case, it doesn't need to learn that variation (some transitions don't need a gap between them) and also fewer tensors on the GPU than having 3 states like in HMM-based synthesisers.

Hope this helps :)

@shivammehta25 shivammehta25 added documentation Improvements or additions to documentation question Further information is requested labels Aug 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants