Automatic Mean Opinion Score Estimation with Temporal Modulation Features on Gammatone Filterbank for Speech Assessment
Quoc-Huy Nguyen, Kai Li, Masashi Unoki
The mean opinion score (MOS) obtained by listening tests is a key component of speech quality evaluation. However, as subjective tests are too costly to conduct on a large scale, it is necessary to estimate the MOS objectively. Thus far, the features used in existing methods for automatic MOS prediction are not based on human perception of speech. In this paper, we propose an automatic MOS estimation method using temporal modulation features on the gammatone filterbank to improve the correlation of the predicted MOS with human perception. We evaluated our method using utterance-level and system-level mean squared errors (MSEs) and Spearman rank correlation coefficients (SRCCs). Compared with the baseline method of the VoiceMOS challenge, the proposed method had a better performance in both utterance-level metrics and system-level SRCC. It also exhibited a significant improvement for utterances with low MOS values.