Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss
Summary: arXiv:2604.12911v1 Announce Type: cross
Abstract
Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similarly to popular reasoning and knowledge benchmarks, but across many languages. We demonstrate that such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency.
Introduction
In recent years, multilingual models have garnered substantial attention in the field of artificial intelligence. However, the benchmarks used to evaluate these models have significant shortcomings. Traditional multilingual evaluations often mirror the structure of reasoning and knowledge assessments, focusing on mathematical reasoning and factual recall. This approach fails to provide a comprehensive understanding of a model’s true multilingual capabilities.
Key Findings
Our research highlights several critical findings regarding the limitations of existing multilingual benchmarks:
- Performance Discrepancies: We found that thinking variants of models significantly outperform instruct variants in these benchmarks. However, this performance does not translate to real-world multilingual tasks, such as those evaluated in LMArena.
- Semantic Gaps: The discrepancies in performance indicate that while models may excel in scripted evaluations, they often struggle with the nuances of actual language use.
- Need for Better Evaluation Methods: Current benchmarks do not adequately assess a model’s multilingual proficiency, highlighting the need for innovative evaluation techniques.
Proposed Solution: Round-Trip Translation
To address these challenges, we propose a straightforward yet effective alternative: round-trip translation. This method involves taking a text in a source language, translating it to a target language, and then translating it back to the original language. The semantic gaps identified between the original text and the resulting translation reveal areas where the model’s multilingual generation capabilities may be lacking.
Our findings indicate that round-trip translation correlates almost perfectly with user ratings on LMArena, achieving a correlation coefficient of 0.94. This method requires no human reference translations and does not necessitate a more capable multilingual judge than the tested models themselves, making it a practical solution for evaluating multilingual capabilities.
Introducing Lost in Translation (LiT)
As part of our research, we introduce the Lost in Translation (LiT) benchmark, a challenging round-trip translation benchmark that spans widely spoken languages around the globe. LiT is designed to provide a more realistic evaluation of multilingual frontier models, addressing the shortcomings of existing benchmarks and offering a comprehensive assessment of a model’s true multilingual proficiency.
Conclusion
In conclusion, while existing multilingual benchmarks have been instrumental in guiding the development of frontier models, they fall short in accurately measuring multilingual proficiency. Our proposed round-trip translation method and the introduction of the LiT benchmark present an innovative approach to evaluating multilingual capabilities, ensuring that future models can better understand and process language in real-world scenarios.
