Lightning OPD: Fast Offline Distillation for Large Models

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Summary: arXiv:2604.13010v1 Announce Type: cross

Abstract

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD.

Introduction

Recent advancements in language models have led to significant improvements in their reasoning capabilities. One of the key techniques employed in enhancing these models is On-policy Distillation (OPD). While OPD has shown promise, it is not without its challenges, particularly the dependency on a live inference server, which can be a barrier to efficient training.

Challenges with Standard OPD

Our research delves into the limitations of standard OPD, focusing on the following key challenges:

Infrastructure Overhead: Maintaining a live teacher inference server incurs high operational costs and resource allocation.
Performance Discrepancies: Preliminary findings indicate that offline variants of OPD do not consistently match the performance levels of their online counterparts.
Teacher Consistency: A critical factor often overlooked is the requirement for the same teacher model to be utilized throughout both supervised fine-tuning (SFT) and OPD.

Understanding Teacher Consistency

Our investigation revealed that violating the teacher consistency condition leads to an irreducible gradient bias. This bias causes both offline and online OPD to converge to a suboptimal fixed point, regardless of the duration of training. This insight is fundamental in addressing the performance gap between standard OPD and its offline variant.

Introducing Lightning OPD

To address the challenges identified, we propose Lightning OPD, an offline on-policy distillation framework that ensures teacher consistency by precomputing teacher log-probabilities over SFT rollouts. The advantages of Lightning OPD include:

Elimination of Live Server Requirement: By precomputing log-probabilities, the need for a live teacher server is completely removed.
Performance Optimization: Under the teacher consistency framework, Lightning OPD achieves the same optimum as standard OPD.
Gradient Discrepancy Control: The method maintains a bounded gradient discrepancy, facilitating a more stable training process.
Regularization Effect: An implicit regularization effect helps in preventing policy drift, enhancing the robustness of the model.

Experimental Results

We conducted extensive experiments focusing on mathematical reasoning and code generation tasks. The results demonstrated that Lightning OPD not only matches but exceeds state-of-the-art performance metrics. Notably, starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD achieved a remarkable score of 69.9% on AIME 2024 in just 30 GPU hours, showcasing a 4.0x speedup over standard OPD.

Conclusion

The introduction of Lightning OPD marks a significant advancement in the efficiency of post-training for large reasoning models. By addressing the critical issue of teacher consistency and eliminating the need for live server infrastructure, Lightning OPD not only enhances performance but also lowers the barriers to entry for academic research in large language model post-training.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Lightning OPD: Fast Offline Distillation for Large Models

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Abstract

Introduction

Challenges with Standard OPD

Understanding Teacher Consistency

Introducing Lightning OPD

Experimental Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related