Efficient Hybrid Sequence Models via Generation-Focused Distillation

When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

Published on arXiv:2603.26556v1

Summary: This article discusses the conversion of pretrained Transformer models into more efficient hybrid models through a process known as distillation. This technique is aimed at reducing inference costs while maintaining the quality of generated outputs.

Abstract

Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2 pp under log-likelihood scoring actually falls behind by 20.8 pp when the model must generate answers autoregressively.

Proposed Methodology

We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes:

Training objective
Loss masking
Training duration
Dataset selection
Parameter freezing
Architecture choice

Key Findings

Our analysis reveals that log-likelihood-based evaluation consistently underestimates the gap between teacher and student models. In some cases, this method can even reverse the ranking of design choices, suggesting that conclusions drawn solely from perplexity evaluation may be misleading. Among the factors we studied, the following had the largest impact on generation quality:

Dataset selection
Completion-only masking
Freezing attention layers during post-training

Performance Metrics

Our best Hybrid-KDA model retains an impressive 86–90% of teacher accuracy on knowledge benchmarks while significantly reducing KV cache memory usage by up to 75%. Furthermore, it enhances time-to-first-token performance by 2–4 times at 128K-token contexts.

Conclusion

This study underscores the importance of generation-based evaluations in the distillation process of hybrid sequence models. By implementing a systematic approach to model design and evaluation, we can achieve more efficient models that do not compromise on the quality of generated outputs.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Efficient Hybrid Sequence Models via Generation-Focused Distillation

When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

Abstract

Proposed Methodology

Key Findings

Performance Metrics

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related