Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization
Published on: arXiv:2604.03192v1
Type: Cross
Abstract
In recent years, the demand for effective summarization techniques has increased, particularly in the context of low-resource languages. This article explores the implementation of multiteacher knowledge distillation for abstractive summarization, emphasizing a reliability-aware perspective. We propose two novel mechanisms: EWAD (Entropy Weighted Agreement Aware Distillation) and CPDP (Capacity Proportional Divergence Preservation), to enhance the summarization process.
Key Mechanisms
- EWAD: This token-level mechanism facilitates the routing of supervision between teacher distillation and gold supervision, driven by inter-teacher agreement.
- CPDP: This mechanism imposes a geometric constraint on the student model’s position, ensuring alignment with heterogeneous teachers.
Research Findings
Our comprehensive experiments utilized two Bangla datasets, involving 13 ablations of the BanglaT5 model and eight experiments with the Qwen2.5 model. The findings reveal several critical insights:
- Logit level knowledge distillation (KD) yields the most reliable performance improvements.
- More sophisticated distillation approaches enhance semantic similarity in short summaries but tend to degrade the quality of longer outputs.
- Cross-lingual pseudo-labeling KD, applied across ten languages, managed to retain 71-122% of the teacher’s ROUGE L scores while achieving a compression rate of 3.2x.
Evaluation Insights
To ensure the robustness of our findings, we conducted a human-validated multi-judge evaluation of large language model (LLM) outputs. This evaluation highlighted a significant calibration bias within single-judge assessment pipelines, suggesting that a multi-judge approach may provide more reliable evaluations.
Conclusion
The results of our study underscore the importance of reliability-aware distillation approaches in enhancing low-resource abstractive summarization. By characterizing the conditions under which multi-teacher supervision improves summarization quality, we provide valuable insights for future research. Additionally, our findings indicate that in some circumstances, scaling data may outweigh the benefits gained from loss engineering.
Future Directions
As the field of natural language processing continues to evolve, the integration of reliability-aware mechanisms in summarization tasks could pave the way for more effective models, particularly for low-resource languages. Future work may explore further refinements to the EWAD and CPDP mechanisms, as well as their applicability to other languages and summarization contexts.
