the motivation for inserting blank IDs between the input IPA-ids? #94

dbkest · 2024-08-25T03:50:52Z

Hello, could you please help me understand the motivation for inserting blank IDs between the input IPA-ids? The implementation code can be found in text_mel_datamodule.py line216:

def get_text(self, text, add_blank=True):
text_norm, cleaned_text = text_to_sequence(text, self.cleaners)
if self.add_blank: #True
text_norm = intersperse(text_norm, 0)
text_norm = torch.IntTensor(text_norm)
return text_norm, cleaned_text

thanks.

shivammehta25 · 2024-08-25T20:37:58Z

Hello, that is a great question!

TLDR:
The idea comes from multiple states per phone in a Hidden Markov Model (HMM) based speech synthesisers for better modelling. [Our previous work Neural-HMM and OverFlow have also used that.] Since Monotonic Alignment Search (MAS) (introduced in Glow-TTS) is a Viterbi approximation to the forward algorithm the idea has its root from the same literature. You can use multiple states to model the transition between different sounds.

More details:
In Statistical Parametric Speech Synthesis (SPSS) times (You can read more about it here in section 2.2 right below equation 2.28), people used multiple states to model each phoneme. They found it beneficial to model certain dynamic features with more states which were especially useful in modelling certain sounds for example plosives (In English: p, t, k, b, d, g), where you have silence, sudden burst in energy and then silence again. These were hard to model for a left-to-right algorithm with no skip (like the MAS) without multiple states representing them as each state had emission parameters.

Modern neural network-based speech synthesisers are much more powerful approximators. So, the idea behind adding an extra state is to provide a placeholder for the MAS to learn such dynamic variation and transition between sounds, where two states seem to be a nice compromise between having the model learn these dynamic variations when needed and jumping directly to next sound in case, it doesn't need to learn that variation (some transitions don't need a gap between them) and also fewer tensors on the GPU than having 3 states like in HMM-based synthesisers.

Hope this helps :)

shivammehta25 added documentation Improvements or additions to documentation question Further information is requested labels Aug 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the motivation for inserting blank IDs between the input IPA-ids? #94

the motivation for inserting blank IDs between the input IPA-ids? #94

dbkest commented Aug 25, 2024

shivammehta25 commented Aug 25, 2024

the motivation for inserting blank IDs between the input IPA-ids? #94

the motivation for inserting blank IDs between the input IPA-ids? #94

Comments

dbkest commented Aug 25, 2024

shivammehta25 commented Aug 25, 2024