Diversity-Aware Reverse KL Divergence for LLM Distillation

Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation

Summary: arXiv:2604.00223v1 Announce Type: cross

Abstract

Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation. This approach has consistently outperformed forward KL (FKL), particularly in scenarios characterized by large vocabularies and significant teacher-student capacity mismatches. RKL focuses learning on dominant modes rather than enforcing dense alignment, making it a favorable choice. However, RKL introduces a structural limitation that drives the student toward overconfident predictions.

Introduction

In this article, we explore the implications of using RKL in LLM distillation. We provide an analysis of RKL by decomposing its gradients into target and non-target components. Our analysis reveals that non-target gradients consistently push the target logit upward, even when the student already matches the teacher, which ultimately reduces output diversity.

Challenges with RKL

Despite the advantages of RKL, our findings indicate several challenges that need to be addressed:

Overconfidence in Predictions: RKL’s structure tends to lead students to make overly confident predictions, which can compromise the quality of generated outputs.
Weak Supervision: The method provides inadequate supervision for non-target classes, resulting in poor tail alignment and reduced overall model performance.
Reduced Diversity: The gradient behavior associated with RKL can lead to a lack of diversity in the outputs generated by the student model.

Proposed Solution: Diversity-aware RKL (DRKL)

To tackle the identified issues, we propose a novel approach named Diversity-aware RKL (DRKL). This method aims to:

Eliminate Gradient Effects: DRKL removes the problematic gradient effect associated with RKL, allowing for more balanced learning.
Enhance Non-target Supervision: By strengthening supervision over non-target classes, DRKL fosters better model alignment and performance.
Maintain Optimization Benefits: While addressing the shortcomings of RKL, DRKL preserves its optimization advantages, ensuring a robust learning process.

Experimental Results

We conducted extensive experiments across various datasets and model families to evaluate the performance of DRKL. The results demonstrate that DRKL consistently outperforms FKL, RKL, and other state-of-the-art distillation objectives. The key findings include:

Improved Performance: DRKL achieved higher performance metrics compared to traditional methods.
Superior Fidelity-Diversity Trade-off: The proposed method offers a better balance between fidelity and diversity in generated outputs.

Conclusion

In conclusion, the Diversity-aware Reverse Kullback-Leibler Divergence (DRKL) presents a significant advancement in large language model distillation. By addressing the limitations of RKL, DRKL not only enhances model performance but also ensures better output diversity, paving the way for more effective and reliable language models.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Diversity-Aware Reverse KL Divergence for LLM Distillation

Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation

Abstract

Introduction

Challenges with RKL

Proposed Solution: Diversity-aware RKL (DRKL)

Experimental Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related