PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
In recent advancements in the field of artificial intelligence, a novel approach known as PACED (Proximal Adaptive Competence-Enhanced Distillation) has emerged, addressing the inefficiencies found in standard large language model (LLM) distillation processes. The research, as detailed in the paper with arXiv identifier 2603.11178v3, proposes a method that optimally focuses on the training of models by weighting problems according to the student’s competence level.
Traditional LLM distillation methods tend to treat all training problems uniformly. This approach can lead to a significant waste of computational resources on problems that the student model has either already mastered or is incapable of solving at that moment. The study presents empirical evidence indicating that this inefficiency exhibits a distinct gradient-level signature, characterized by a bell curve in the cross-problem gradient signal-to-noise ratio (SNR). This SNR collapses at both ends of the student pass rate spectrum, indicating a need for a more refined approach to distillation.
The core innovation of PACED revolves around the proposed weighting function, represented as w(p) = p(1{-}p), where p signifies the student’s empirical pass rate. This function allows for a concentrated training effort on problems that lie within the student’s zone of proximal development—essentially targeting areas where the student is most likely to improve. Notably, this method requires only student rollouts, eliminating the need for architectural modifications or additional hyperparameters.
The authors of the study demonstrate that the Beta kernel, represented as w(p) = p^\alpha(1{-}p)^\beta, serves as the leading-order optimal weight family derived from the SNR boundary-collapse structure. This kernel is shown to be minimax-robust under misspecification, with a worst-case efficiency loss bounded by O(\delta^2).
Experimental Results
The effectiveness of PACED has been validated across various model families, including Qwen3, Qwen2.5, and Llama-3. The experimental results reveal that PACED achieves a new state-of-the-art performance in benchmark tests on MATH-500, AIME 2024, and AIME 2025. Specifically, the method demonstrates improvements over unweighted distillation by margins of up to +8.2 and outperforms a robust baseline, AKL, by +3.6.
Additionally, PACED significantly reduces forgetting rates, achieving just 1.4% in distillation and 0.6% in self-distillation scenarios. A strategic implementation of a two-stage forward-then-reverse KL schedule further enhances these gains, pushing improvements to +5.8 over traditional forward KL methodologies on the most challenging benchmarks.
Conclusion
The introduction of PACED marks a significant advancement in the training of LLMs by efficiently targeting the learning needs of student models. By focusing on the zone of proximal development and employing a robust weighting mechanism, this approach not only optimizes resource utilization but also enhances the overall performance and retention capabilities of AI models. As LLMs continue to evolve, innovations like PACED will be crucial in pushing the boundaries of what these technologies can achieve.
