The Signal is in the Steps: Local Scoring for Reasoning Data Selection
Summary: arXiv:2510.03988v2 Announce Type: replace-cross
Abstract: Distilling long-form reasoning from teacher models into smaller students requires selecting which candidate solutions to train on. Recent work argues that one should select responses the student model assigns highest probability, i.e., favoring solutions “natural” to the student. However, we find that this approach works within a single teacher but fails when scaling to long reasoning traces from multiple diverse teachers. We identify a key cause: this approach scores entire solutions, but students generalize by recombining familiar reasoning steps, not by memorizing complete solutions. Full-trajectory scoring optimizes the wrong target; it rewards global fluency while the transferable signal lies in local step transitions. We propose Local Average Log Probability (LALP), which scores each reasoning step using only a small window of preceding context, measuring whether each step is justified by its immediate premises rather than whether the full response looks natural to the student. LALP enables two practical use cases: selecting the best teacher before fine-tuning and curating training data from diverse teacher pools. Across math, coding, and science reasoning tasks, LALP consistently improves accuracy when selecting the most natural solutions by a large margin.
Introduction
The process of training smaller models to emulate the reasoning capabilities of larger teacher models poses unique challenges. As the demand for efficient AI systems grows, the need for effective data selection methods has become increasingly important. Traditional methods have focused on selecting the most probable responses according to the student model; however, this approach often leads to suboptimal outcomes when scaling across diverse teaching methodologies.
Challenges with Current Methods
One of the primary challenges identified in existing methodologies is the reliance on full-trajectory scoring. This method evaluates entire solutions rather than breaking down the reasoning process into manageable components. As a result, it tends to overlook the nuanced transitions between reasoning steps that enable a student model to generalize effectively. Instead of memorizing solutions, students learn by integrating familiar steps from various contexts.
Introducing Local Average Log Probability (LALP)
To address these shortcomings, the authors propose a novel scoring method: Local Average Log Probability (LALP). This approach emphasizes the importance of local context by scoring each reasoning step based on a limited window of preceding information. By measuring the justification of each step against its immediate premises, LALP shifts the focus from global coherence to local accuracy.
Practical Applications of LALP
LALP introduces two significant use cases in the realm of AI training:
- Selecting the Best Teacher: Before fine-tuning a student model, LALP can help identify the teacher model that provides the most appropriate reasoning steps, thereby enhancing the overall learning process.
- Curating Training Data: LALP allows for the efficient selection of training data from a pool of diverse teacher models, ensuring that the student model receives the most relevant and beneficial examples.
Results and Conclusion
In empirical evaluations across various reasoning tasks—including math, coding, and science—LALP demonstrated a marked improvement in accuracy when compared to traditional scoring methods. This advancement underscores the importance of focusing on local reasoning transitions rather than merely the fluency of complete responses. The findings encourage a shift in how AI training data is selected, paving the way for more robust and adaptable AI systems.
