MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
Full-Duplex Speech Language Models (FD-SLMs) represent a significant advancement in the field of conversational AI, enabling real-time interactions that allow speakers to talk simultaneously. This capability enhances the user experience, making conversations feel more natural compared to traditional half-duplex systems, where only one speaker can communicate at a time. Despite the promise of FD-SLMs, current evaluation benchmarks have largely overlooked the intricacies of multi-round conversations, focusing primarily on single-turn interactions.
Recent research highlighted in the preprint arXiv:2511.10262v3 addresses these shortcomings by introducing a new benchmark called MTR-DuplexBench. This tool aims to provide a comprehensive evaluation framework for FD-SLMs specifically in multi-round conversational settings. The introduction of MTR-DuplexBench marks a critical step towards understanding and improving the performance of FD-SLMs in more complex and realistic conversational scenarios.
Challenges in Evaluating FD-SLMs
Evaluating FD-SLMs in multi-round contexts presents several challenges:
- Blurred Turn Boundaries: In natural conversations, speakers often overlap, making it difficult to determine clear turn boundaries.
- Context Inconsistency: Maintaining context over multiple rounds can be complex, as information can be misinterpreted or lost.
- Narrow Evaluation Focus: Existing benchmarks tend to concentrate only on conversational features, ignoring other critical dimensions such as dialogue quality and safety.
Introducing MTR-DuplexBench
MTR-DuplexBench addresses these gaps by offering a structured approach to evaluating FD-SLMs through the following features:
- Segmented Dialogue Assessment: The benchmark divides continuous full-duplex dialogues into discrete turns, allowing for a more granular, turn-by-turn evaluation.
- Multi-Dimensional Evaluation: It incorporates various aspects of conversation analysis, including:
- Conversational Features
- Dialogue Quality
- Instruction Following
- Safety Measures
Experimental Insights
Initial experiments utilizing MTR-DuplexBench indicate that current FD-SLMs struggle to maintain consistent performance across multiple rounds of conversation and various evaluation dimensions. This inconsistency underscores the necessity for a robust evaluation framework like MTR-DuplexBench, which not only facilitates comprehensive assessments but also encourages the development of more capable FD-SLMs.
Conclusion
The introduction of MTR-DuplexBench represents a significant milestone in the evaluation of Full-Duplex Speech Language Models. By addressing the complexities of multi-round conversations and broadening the evaluation criteria, MTR-DuplexBench is poised to enhance the development of more effective conversational AI systems. Researchers and practitioners can access the code and data for MTR-DuplexBench at GitHub.
