Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs
Summary: arXiv:2604.02766v1 Announce Type: cross
Abstract
Modern LLMs inherit strong priors from web-scale pretraining, which can limit the headroom of post-training data-selection strategies. While Active Preference Learning (APL) seeks to optimize query efficiency in online Direct Preference Optimization (DPO), the inherent richness of on-policy candidate pools often renders simple Random sampling a surprisingly formidable baseline.
Key Findings
This article evaluates uncertainty-based APL against Random across various settings, including harmlessness, helpfulness, and instruction-following. The evaluation employs both reward models and LLM-as-a-judge proxies to measure the effectiveness of these strategies.
Methodology
The study involves the following key components:
- Active Preference Learning (APL): A method intended to enhance query efficiency in the context of online Direct Preference Optimization.
- Random Sampling: A baseline method that utilizes random selection from a rich pool of on-policy candidates.
- Evaluation Criteria: The effectiveness of these methods is assessed based on three primary metrics: harmlessness, helpfulness, and instruction-following.
Results
The findings from the evaluation reveal some surprising insights:
- APL yields negligible improvements in proxy win-rates compared to Random sampling.
- A dissociation is observed where win-rate improves even as the general capability, measured by standard benchmarks, degrades.
- APL does not effectively mitigate capability collapse or significantly reduce variance when compared to random sampling.
Implications
This research highlights important implications for the field of AI and machine learning:
- In scenarios dominated by strong pre-trained priors, the computational overhead associated with active selection may not be justified.
- The “cheap diversity” offered by simple random samples can often outperform more complex selection strategies.
- Future research should consider the balance between computational efficiency and the effectiveness of selection methods in LLM training.
Conclusion
The study’s conclusions prompt a reevaluation of how active selection methods are applied in the training of modern LLMs. As the field continues to evolve, understanding the dynamics between pre-trained priors and selection strategies will be crucial for optimizing performance.
For more details, the code and data used in this research are publicly available at https://github.com/BootsofLagrangian/random-vs-apl.
