Self-Distillation for Multi-Token Prediction
Summary: arXiv:2603.23911v1
Type: cross
As Large Language Models (LLMs) continue to scale up, the need for improved inference efficiency has become a pressing issue within the field of artificial intelligence. Multi-Token Prediction (MTP) emerges as a promising avenue to enhance LLM inference by allowing the models to predict multiple future tokens simultaneously. Nevertheless, the current approaches for MTP face significant challenges that hinder their effectiveness and practicality.
In this article, we introduce MTP-D, a novel self-distillation method designed to address two major obstacles associated with existing MTP strategies: the limited acceptance rates of MTP heads and the complexities involved in jointly training multiple MTP heads.
Challenges in Multi-Token Prediction
Despite the potential benefits of MTP, there are notable challenges:
- Limited Acceptance Rates: The acceptance rates of MTP heads have been historically low, impeding the ability of models to leverage the full advantages of parallel predictions.
- Joint Training Complexities: Training multiple MTP heads concurrently presents difficulties, which can lead to suboptimal performance and increased resource consumption.
Introducing MTP-D
MTP-D provides a simple yet effective solution to these challenges with minimal additional training costs. Our method demonstrates a remarkable improvement in the acceptance rates of MTP heads, achieving a +7.5% increase while maintaining the performance of the main head. This enhancement is critical for ensuring that LLMs can make the most out of their predictive capabilities.
Looped Extension Strategy
In addition to MTP-D, we introduce a looped extension strategy. This innovative approach enables the effective and economical extension of MTP heads. Through this method, we have observed a significant increase in inference speed, achieving a remarkable +220.4% speedup for 1-head MTP. This advancement is particularly beneficial for applications that require rapid response times, such as conversational agents and real-time translation services.
Key Insights and Validation
Our research delves into the underlying principles of distillation strategies and explores the scalability potential of MTP through comprehensive experiments conducted on seven diverse benchmarks. These experiments have yielded compelling results, affirming that our MTP-D method, combined with the looped extension strategy, effectively enhances the performance of MTP heads while simultaneously improving inference efficiency.
Conclusion
In conclusion, the introduction of MTP-D and the looped extension strategy marks a significant advancement in the field of multi-token prediction for large language models. By addressing the existing challenges and enhancing efficiency, these innovations pave the way for practical and scalable applications of MTP in real-world scenarios. As the demand for faster and more efficient AI-driven solutions continues to grow, the implications of this research could be transformative for the future of LLMs.
