How to Evaluate Speech Translation with Source-Aware Neural MT Metrics
Summary: arXiv:2511.03295v3 Announce Type: replace-cross
The automatic evaluation of speech translation (ST) systems has traditionally relied on comparing translation hypotheses with one or more reference translations. While this method is somewhat effective, it carries the inherent limitation of reference-based evaluation, which overlooks valuable information present in the source input. Recent advancements in machine translation (MT) have demonstrated that neural metrics that incorporate the source text achieve a stronger correlation with human judgments. However, extending this concept to speech translation is challenging due to the audio nature of the source, where reliable transcripts or alignments between the source and references are often unavailable.
Research Overview
In this article, we present the first systematic study of source-aware metrics specifically designed for speech translation. Our research focuses on real-world operating conditions where source transcripts are often lacking. We explore two complementary strategies to generate textual proxies from the input audio:
- ASR Transcripts: Automatic Speech Recognition (ASR) systems convert spoken language into written text, providing a potential source representation.
- Back-Translations: This method involves translating the reference translation back into the source language, thereby generating a synthetic source.
To tackle the alignment mismatches between these synthetic sources and reference translations, we introduce a novel two-step cross-lingual re-segmentation algorithm. This algorithm is crucial in ensuring that the evaluation metrics are both reliable and valid.
Experimental Findings
Our experiments were conducted on two distinct ST benchmarks, encompassing 79 language pairs and six ST systems characterized by a variety of architectures and performance levels. The results indicate that:
- ASR transcripts serve as a more reliable synthetic source than back-translations when the word error rate is below 20%.
- Back-translations, although slightly less reliable, present a computationally cheaper yet effective alternative for evaluation.
These findings are further validated through experiments on a low-resource language pair, specifically Bemba-English, and by direct comparison against human quality judgments. The robustness and applicability of our approach highlight the potential for improved evaluation methodologies in the domain of speech translation.
Conclusion
Our proposed cross-lingual re-segmentation algorithm not only facilitates the robust application of source-aware MT metrics in the evaluation of speech translation but also sets the groundwork for more accurate and principled evaluation methodologies in the future. As the field of speech translation continues to evolve, the integration of these advanced metrics will be vital in achieving higher quality translations and better understanding of machine-generated outputs.
