TIP: Token Importance in On-Policy Distillation
Summary: arXiv:2604.14084v1 Announce Type: cross
Abstract
On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher-student divergence, where the student is overconfident and wrong.
Key Findings
Empirically, student entropy is a strong first-order proxy: retaining 50% of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to 47%. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than 10% of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules.
Introduction to TIP
We organize these findings with TIP (Token Importance in On-Policy Distillation), a two-axis taxonomy over student entropy and teacher-student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement.
Methodology
Our research validates this picture across three teacher-student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning. The methodology revolves around the following key components:
- Entropy-based Sampling: Retaining a significant portion of tokens based on entropy to optimize learning.
- High-Divergence Tokens: Identifying low-entropy, high-divergence tokens that provide critical learning signals.
- Two-Axis Taxonomy: Utilizing the TIP framework to systematically evaluate token importance.
- Experimental Validation: Testing across multiple datasets to ensure robustness of findings.
Conclusion
The findings of this study provide new insights into the mechanics of on-policy knowledge distillation. By focusing on token importance, we can substantially enhance the efficiency and effectiveness of training models. The implications of this research extend beyond theoretical frameworks, offering practical guidelines for future applications in machine learning.
Future Work
Moving forward, we aim to explore additional avenues for refining token selection strategies and investigating the broader implications of token importance across various domains in artificial intelligence.
