Two-Stage Optimizer-Aware Online Data Selection for Large Language Models
Summary: arXiv:2604.00001v1
Announce Type: cross
Abstract
Gradient-based data selection offers a principled framework for estimating sample utility in large language model (LLM) fine-tuning. However, existing methods are primarily designed for offline settings, making them less suitable for online fine-tuning where data is presented sequentially. In such scenarios, sample utility is step-dependent, and the effective update geometry is influenced by adaptive optimizers. To address this challenge, we propose an optimizer-aware framework for gradient-based online data selection and reweighting specifically tailored for LLM fine-tuning.
Introduction
The key innovation of our approach lies in viewing online data selection not merely as a static ranking of samples, but as a process that shapes the next target-oriented update based on the current optimizer state. This perspective allows for a more dynamic and effective selection strategy that can adapt to the evolving nature of data flow in online environments.
Methodology
We formulate the online data selection problem as an optimizer-aware update-matching challenge. This formulation establishes a connection to second-order target utility, highlighting the importance of considering interactions and redundancy among selected samples during subset-level construction. Our proposed solution is encapsulated in a two-stage Filter-then-Weight algorithm:
- Filter Stage: This initial stage focuses on filtering candidates that are geometrically useful for the current update.
- Weight Stage: In this subsequent stage, we optimize the coefficients of the filtered candidates to maximize their utility in the update process.
Practical Implementation
To translate our theoretical framework into a practical solution for LLMs, we introduce a factorized outer-product gradient representation. This aids in efficient computations, particularly for long-context data, ensuring that our method is not only effective but also scalable to real-world applications.
Results
We conducted a series of experiments to evaluate the performance of our proposed method against existing online data selection baselines. The results consistently demonstrate that our two-stage Filter-then-Weight algorithm significantly improves convergence rates and downstream performance, all while operating within the same data budget.
Conclusion
In conclusion, our optimizer-aware online data selection approach represents a significant advancement in the fine-tuning of large language models. By redefining the selection process as a dynamic interaction with the optimizer state, we offer a solution that is not only theoretically sound but also practically viable. Future work will explore further enhancements and applications of this method in diverse LLM fine-tuning scenarios.
