Token Importance in On-Policy Distillation Explained

TIP: Token Importance in On-Policy Distillation

Summary: arXiv:2604.14084v1 Announce Type: cross

Abstract

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher-student divergence, where the student is overconfident and wrong.

Key Findings

Empirically, student entropy is a strong first-order proxy: retaining 50% of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to 47%. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than 10% of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules.

Introduction to TIP

We organize these findings with TIP (Token Importance in On-Policy Distillation), a two-axis taxonomy over student entropy and teacher-student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement.

Methodology

Our research validates this picture across three teacher-student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning. The methodology revolves around the following key components:

Entropy-based Sampling: Retaining a significant portion of tokens based on entropy to optimize learning.
High-Divergence Tokens: Identifying low-entropy, high-divergence tokens that provide critical learning signals.
Two-Axis Taxonomy: Utilizing the TIP framework to systematically evaluate token importance.
Experimental Validation: Testing across multiple datasets to ensure robustness of findings.

Conclusion

The findings of this study provide new insights into the mechanics of on-policy knowledge distillation. By focusing on token importance, we can substantially enhance the efficiency and effectiveness of training models. The implications of this research extend beyond theoretical frameworks, offering practical guidelines for future applications in machine learning.

Future Work

Moving forward, we aim to explore additional avenues for refining token selection strategies and investigating the broader implications of token importance across various domains in artificial intelligence.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Token Importance in On-Policy Distillation Explained

TIP: Token Importance in On-Policy Distillation

Abstract

Key Findings

Introduction to TIP

Methodology

Conclusion

Future Work

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related