From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs
In the rapidly evolving field of automatic speech recognition (ASR), researchers are continually seeking methods to enhance the performance of Speech-LLMs (Language Models). A significant challenge has emerged in the form of contextual exposure bias, which occurs when ASR systems trained on ideal conversation histories are deployed in real-world scenarios where the input may be error-prone. A recent study, referenced as arXiv:2603.24034v1, proposes a unified training framework aimed at mitigating this bias and improving the robustness of Speech-LLMs in practical applications.
Understanding Contextual Exposure Bias
Contextual exposure bias arises during inference when the model relies on imperfect conversational history, leading to discrepancies between training and testing environments. This mismatch can significantly hinder the performance of Speech-LLMs, especially in dynamic and noisy contexts. The traditional training methods often depend on oracle conversation history, which does not reflect the variability encountered in real-world situations.
Proposed Solutions
The study introduces three innovative strategies designed to address contextual exposure bias:
- Teacher Error Knowledge: This approach utilizes Whisper large-v3 hypotheses as the training-time history, allowing models to learn from realistic, error-prone contexts.
- Context Dropout: This technique acts as a regularizer to prevent models from becoming overly reliant on the context provided, thereby enhancing their ability to perform under uncertain conditions.
- Direct Preference Optimization (DPO): By focusing on curated failure cases, DPO seeks to refine model preferences and improve decision-making in challenging scenarios.
Experimental Results
Extensive experiments were conducted using the TED-LIUM 3 dataset (in-domain) and zero-shot LibriSpeech (out-of-domain) to evaluate the effectiveness of the proposed methods. The results demonstrated consistent improvements in performance when using predicted-history decoding:
- With a two-utterance history as context, the introduction of SFT with Whisper hypotheses led to a reduction in word error rate (WER) from 5.59% (oracle-history training) to 5.47%.
- Further optimization using DPO achieved an impressive reduction to a WER of 5.17%.
- Under scenarios involving irrelevant-context attacks, DPO displayed remarkable resilience, resulting in the smallest degradation, from 5.17% to 5.63%.
Conclusion
The findings of this study indicate that the proposed unified training framework effectively mitigates contextual exposure bias, leading to improved robustness in Speech-LLMs. By addressing the discrepancies between training and testing environments, these methods pave the way for more reliable ASR systems in real-world applications. For those interested in further exploring this research, the code and models are publicly available at GitHub Repository.
