When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models
Published on arXiv:2603.26556v1
Summary: This article discusses the conversion of pretrained Transformer models into more efficient hybrid models through a process known as distillation. This technique is aimed at reducing inference costs while maintaining the quality of generated outputs.
Abstract
Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2 pp under log-likelihood scoring actually falls behind by 20.8 pp when the model must generate answers autoregressively.
Proposed Methodology
We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes:
- Training objective
- Loss masking
- Training duration
- Dataset selection
- Parameter freezing
- Architecture choice
Key Findings
Our analysis reveals that log-likelihood-based evaluation consistently underestimates the gap between teacher and student models. In some cases, this method can even reverse the ranking of design choices, suggesting that conclusions drawn solely from perplexity evaluation may be misleading. Among the factors we studied, the following had the largest impact on generation quality:
- Dataset selection
- Completion-only masking
- Freezing attention layers during post-training
Performance Metrics
Our best Hybrid-KDA model retains an impressive 86–90% of teacher accuracy on knowledge benchmarks while significantly reducing KV cache memory usage by up to 75%. Furthermore, it enhances time-to-first-token performance by 2–4 times at 128K-token contexts.
Conclusion
This study underscores the importance of generation-based evaluations in the distillation process of hybrid sequence models. By implementing a systematic approach to model design and evaluation, we can achieve more efficient models that do not compromise on the quality of generated outputs.
