SPICE: Submodular Penalized Information-Conflict Selection for Efficient Large Language Model Training
Summary: arXiv:2601.23155v2 Announce Type: replace-cross
Abstract: Information-based data selection for instruction tuning is compelling: maximizing the log-determinant of the Fisher information yields a monotone submodular objective, enabling greedy algorithms to achieve a (1-1/e) approximation under a cardinality budget. In practice, however, we identify alleviating gradient conflicts, misalignment between per-sample gradients, as a key factor that slows down the decay of marginal log-determinant information gains, thereby preventing significant loss of information. We formalize this via an ε-decomposition that quantifies the deviation from ideal submodularity as a function of conflict statistics, yielding data-dependent approximation factors that tighten as conflicts diminish. Guided by this analysis, we propose SPICE, a conflict-aware selector that maximizes information while penalizing misalignment, and that supports early stopping and proxy models for efficiency. Empirically, SPICE selects subsets with higher log-determinant information than original criteria, and these informational gains translate into performance improvements: across 8 benchmarks with LLaMA2-7B and Qwen2-7B, SPICE uses only 10% of the data, yet matches or exceeds 6 methods including full-data tuning. This achieves performance improvements with substantially lower training cost.
Introduction
The rapid advancements in large language models (LLMs) have underscored the need for efficient training methodologies. Traditional data selection methods based on Fisher information have shown promise but are limited by issues related to gradient conflicts. The SPICE framework addresses these challenges by introducing a novel approach to data selection that is informed by the underlying information structure of the data.
Understanding the Challenge
As LLMs grow in size and complexity, optimizing their training becomes increasingly critical. The main challenges identified include:
- Gradient Conflicts: Misalignment between gradients can hinder the training process and slow down information gain.
- Loss of Information: Traditional methods may not adequately capture the nuances of the data, leading to suboptimal training outcomes.
- Computational Efficiency: Training large models on vast datasets is resource-intensive, necessitating methods that can reduce data requirements without sacrificing performance.
The SPICE Framework
SPICE stands for Submodular Penalized Information-Conflict Selection and represents a significant advancement in the field of efficient training for LLMs. Key features of SPICE include:
- Conflict Awareness: By incorporating misalignment penalties into the selection process, SPICE effectively minimizes gradient conflicts.
- Early Stopping: The framework supports early stopping criteria, allowing for quicker convergence during training.
- Proxy Models: SPICE can leverage proxy models to enhance selection efficiency without a full retraining of the primary model.
Empirical Results
In empirical evaluations using the LLaMA2-7B and Qwen2-7B models across eight different benchmarks, SPICE demonstrated significant improvements:
- Selection of data subsets yielding higher log-determinant information compared to traditional methods.
- Achieving comparable or superior performance to six other methods while utilizing only 10% of the original dataset.
- Reduction in training costs while maintaining high performance levels.
Conclusion
The introduction of SPICE marks a pivotal development in the quest for efficient large language model training. By focusing on conflict-aware data selection, SPICE not only enhances the training process but also sets a new benchmark for future research in this rapidly evolving field.
