Neural Networks for Text-to-Speech Evaluation
Summary: arXiv:2604.08562v1 Announce Type: cross
Ensuring that Text-to-Speech (TTS) systems deliver human-perceived quality at scale is a central challenge for modern speech technologies. Human subjective evaluation protocols such as Mean Opinion Score (MOS) and Side-by-Side (SBS) comparisons remain the de facto gold standards, yet they are expensive, slow, and sensitive to pervasive assessor biases. This study addresses these barriers by formulating, and implementing a suite of novel neural models designed to approximate expert judgments in both relative (SBS) and absolute (MOS) settings.
Introduction
The evaluation of TTS systems is crucial in determining their effectiveness and usability in real-world applications. Traditional methods, while reliable, come with significant limitations that necessitate the exploration of automated alternatives. This research aims to overcome these challenges by leveraging advanced neural network architectures.
Methodology
For relative assessment, we propose a model called NeuralSBS, which is backed by HuBERT technology. This model demonstrates a remarkable accuracy of 73.7% on the SOMOS dataset. For absolute assessment, we introduce enhancements to MOSNet, employing custom sequence-length batching to optimize performance. Additionally, we present WhisperBert, a multimodal stacking ensemble that integrates Whisper audio features with BERT textual embeddings using weak learners.
Results
Our best models for Mean Opinion Score evaluation achieve a Root Mean Square Error (RMSE) of approximately 0.40. This performance notably surpasses the human inter-rater RMSE baseline of 0.62, indicating a significant advancement in TTS evaluation methodologies. Furthermore, our ablation studies reveal critical insights into model performance variations.
Ablation Studies
In our studies, we discovered that naively fusing text via cross-attention mechanisms can lead to performance degradation. This finding highlights the superiority of ensemble-based stacking approaches over direct latent fusion techniques. These insights are pivotal for future developments in TTS evaluation frameworks.
Negative Results
Our research also included experiments with SpeechLM-based architectures and zero-shot LLM evaluators, such as Qwen2-Audio and Gemini 2.5 flash preview. The negative results from these evaluations reinforce the necessity of dedicated metric learning frameworks tailored specifically for TTS assessment.
Conclusion
This study presents a significant step forward in the automated evaluation of TTS systems. By harnessing the power of neural networks, we can approximate human judgment with greater accuracy, thereby streamlining the evaluation process while maintaining quality standards. The implications of this research are vast, promising enhancements in the development and deployment of TTS technologies across various applications.
Future Work
Looking ahead, further exploration into advanced architectures and evaluation frameworks is essential. Continued refinement of ensemble methods and the integration of innovative features will contribute to more robust and reliable TTS systems.
