HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation
Summary: arXiv:2603.23871v1 Announce Type: cross
Introduction
Large language models (LLMs) have become integral tools in the field of artificial intelligence, particularly for tasks that require mathematical reasoning. However, a critical challenge arises when these models encounter “cliff” prompts—problems they are inherently unable to solve. In such cases, the reinforcement learning (RL) gradient diminishes to zero, effectively eliminating any potential learning signal for these failure modes. To address this issue, a novel approach known as Hybrid Distillation Policy Optimization (HDPO) has been proposed.
The HDPO Methodology
HDPO enhances traditional RL techniques by integrating privileged self-distillation that specifically targets cliff prompts. The core mechanism of HDPO operates through the following steps:
- Identification of Problematic Prompts: During each training iteration, HDPO identifies prompts where the model’s rollouts consistently fail.
- Generation of Privileged Rollouts: For these identified prompts, privileged rollouts are created by equipping the model with ground-truth information to guide its responses.
- Filtering Correct Solutions: The model then filters the generated rollouts to extract the correct solutions.
- Token-Level Distillation: Finally, the teacher model’s token-level distribution is distilled into the student model, which shares identical weights but differs in input.
Bounds on Realizability Gap
One of the significant advantages of HDPO is its ability to maintain a provably bounded realizability gap. This is in contrast to conventional cross-model distillation methods, where discrepancies between the teacher and student models can lead to larger gaps. The theoretical underpinning of HDPO demonstrates that when the filtering process is set to R=1, the privileged generation effectively recovers the optimal KL-regularized RL policy in the hard-threshold limit.
Experimental Results
To validate the effectiveness of HDPO, extensive experiments were conducted using the OpenMathInstruct-2 dataset with the Qwen2.5-Math-1.5B-Instruct model. The results indicate that HDPO consistently improves coverage metrics. Specifically, the pass rates showed significant enhancements:
- Pass@4 increased by +0.8-1.1%
- Pass@8 improved by +0.4-1.7%
Moreover, the implementation of a distillation weight, denoted as lambda, provides direct control over the exploration-exploitation tradeoff, allowing for a balanced approach in model training.
Conclusion
The introduction of Hybrid Distillation Policy Optimization marks a significant advancement in the training of large language models for mathematical reasoning. By effectively addressing the challenges posed by cliff prompts, HDPO not only enhances the learning capabilities of these models but also sets a new benchmark for future research in reinforcement learning and model optimization.
