Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?
Summary: arXiv:2604.08723v1 Announce Type: cross
Abstract: Preference optimization methods such as DPO and KTO are widely used for aligning language models, yet little is understood about what properties of preference data drive downstream reasoning gains. We ask: what aspects of a preference pair improve a reasoning model’s performance on general reasoning tasks? We investigate two distinct notions of quality delta in preference data: generator-level delta and sample-level delta.
This article delves into the complexities surrounding preference optimization in language models, particularly focusing on how the structural aspects of preference pairs contribute to enhancing reasoning capabilities in AI. Preference optimization techniques, including DPO (Dynamic Preference Optimization) and KTO (Kullback-Leibler Temperature Optimization), have gained traction for aligning language models to perform better on various tasks.
Understanding Quality Delta
To better understand the improvements in reasoning tasks, we explore two distinct notions of quality delta in preference data:
- Generator-Level Delta: This arises from differences in capability between models that generate chosen and rejected reasoning traces. Essentially, it examines how variations in the generating models influence the resultant preference pairs.
- Sample-Level Delta: This refers to the differences in judged quality within an individual preference pair. It focuses on how the characteristics of the specific examples provided in the preference pairs affect the model’s performance.
Investigative Methodology
To study generator-level delta, our approach involved varying the generator’s scale and model family. This method allowed us to assess how different configurations of language models influence the effectiveness of preference pairs. For sample-level delta, we employed a large language model (LLM) as a judge to evaluate the quality of generated traces across multiple reasoning-quality dimensions.
Key Findings
Our comprehensive analysis revealed critical insights:
- Increasing generator-level delta consistently leads to improved performance on out-of-domain reasoning tasks. This finding underscores the importance of selecting high-quality generators when constructing preference pairs.
- Filtering data by sample-level delta can significantly enhance data efficiency during training. By concentrating on the most informative training examples, models can achieve better outcomes with less data.
Conclusion
Our results advocate a twofold strategy for optimizing reasoning performance through preference optimization:
- Maximize generator-level delta when constructing preference pairs to ensure a robust foundation for reasoning tasks.
- Leverage sample-level delta to identify and select the most informative training examples, further refining the model’s learning process.
In summary, understanding the nuances of preference pairs and their associated quality deltas can significantly enhance the effectiveness of language models in reasoning tasks. Continued exploration in this domain will pave the way for more sophisticated AI systems capable of nuanced understanding and reasoning.
