Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning
Summary: arXiv:2505.16176v2 Announce Type: replace
Abstract: In mathematical reasoning, data selection strategies predominantly rely on static, externally defined metrics, which fail to adapt to the evolving capabilities of models during training. This misalignment limits the efficiency of Supervised Fine-Tuning and Reinforcement Learning. To bridge this gap, we introduce SAI-DPO (Self-Aware Iterative Data Persistent Optimization), a dynamic sampling framework that aligns training data with the model’s intrinsic competence.
SAI-DPO operationalizes two novel metrics:
- Knowledge Semantic Alignment: This metric targets domain weaknesses by aligning the training data with areas where the model is underperforming.
- Self-Aware Difficulty: Derived from pass rates and reasoning path characteristics, this metric gauges instance complexity relative to the model’s current state.
By iteratively recalibrating the data distribution based on real-time feedback, SAI-DPO dynamically aligns training samples with the model’s evolving competence. This ensures that the data remains strictly relevant to the model’s current capability level, ultimately enhancing the effectiveness of the training process.
Key Features of SAI-DPO
SAI-DPO introduces a paradigm shift in the way training data is utilized in mathematical reasoning tasks. Here are some of its key features:
- Dynamic Adaptation: Unlike traditional static sampling methods, SAI-DPO adapts to the model’s learning progress, ensuring that it always works with the most pertinent data.
- Real-Time Feedback Integration: The framework integrates real-time feedback to adjust the data distribution, thereby maintaining alignment with the model’s evolving capabilities.
- Enhanced Training Efficiency: Through the use of SAI-DPO, models can achieve state-of-the-art performance levels with significantly less data, making it a cost-effective solution for training.
Experimental Validation
Extensive experiments conducted on eight benchmarks, including AIME24 and AMC23, demonstrate that SAI-DPO outperforms static baselines by nearly 6 points on average. This substantial improvement highlights the effectiveness of dynamic sampling in enhancing model performance during training.
In conclusion, SAI-DPO represents a significant advancement in the field of mathematical reasoning. By providing a framework that aligns training data with the model’s intrinsic competence, it addresses the limitations of traditional static data selection methods. As the demand for efficient and effective training methodologies continues to grow, SAI-DPO stands out as a promising solution that not only optimizes the training process but also paves the way for future innovations in machine learning.
