Diversity-Aware Reverse KL Divergence for LLM Distillation

Date:

Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation

Summary: arXiv:2604.00223v1 Announce Type: cross

Abstract

Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation. This approach has consistently outperformed forward KL (FKL), particularly in scenarios characterized by large vocabularies and significant teacher-student capacity mismatches. RKL focuses learning on dominant modes rather than enforcing dense alignment, making it a favorable choice. However, RKL introduces a structural limitation that drives the student toward overconfident predictions.

Introduction

In this article, we explore the implications of using RKL in LLM distillation. We provide an analysis of RKL by decomposing its gradients into target and non-target components. Our analysis reveals that non-target gradients consistently push the target logit upward, even when the student already matches the teacher, which ultimately reduces output diversity.

Challenges with RKL

Despite the advantages of RKL, our findings indicate several challenges that need to be addressed:

  • Overconfidence in Predictions: RKL’s structure tends to lead students to make overly confident predictions, which can compromise the quality of generated outputs.
  • Weak Supervision: The method provides inadequate supervision for non-target classes, resulting in poor tail alignment and reduced overall model performance.
  • Reduced Diversity: The gradient behavior associated with RKL can lead to a lack of diversity in the outputs generated by the student model.

Proposed Solution: Diversity-aware RKL (DRKL)

To tackle the identified issues, we propose a novel approach named Diversity-aware RKL (DRKL). This method aims to:

  • Eliminate Gradient Effects: DRKL removes the problematic gradient effect associated with RKL, allowing for more balanced learning.
  • Enhance Non-target Supervision: By strengthening supervision over non-target classes, DRKL fosters better model alignment and performance.
  • Maintain Optimization Benefits: While addressing the shortcomings of RKL, DRKL preserves its optimization advantages, ensuring a robust learning process.

Experimental Results

We conducted extensive experiments across various datasets and model families to evaluate the performance of DRKL. The results demonstrate that DRKL consistently outperforms FKL, RKL, and other state-of-the-art distillation objectives. The key findings include:

  • Improved Performance: DRKL achieved higher performance metrics compared to traditional methods.
  • Superior Fidelity-Diversity Trade-off: The proposed method offers a better balance between fidelity and diversity in generated outputs.

Conclusion

In conclusion, the Diversity-aware Reverse Kullback-Leibler Divergence (DRKL) presents a significant advancement in large language model distillation. By addressing the limitations of RKL, DRKL not only enhances model performance but also ensures better output diversity, paving the way for more effective and reliable language models.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.