LLM Reasoning with Process Rewards for Outcome-Guided Steps
Summary: arXiv:2604.02341v1 Announce Type: cross
Abstract: The landscape of mathematical reasoning in large language models (LLMs) has witnessed significant advancements through the application of reinforcement learning (RL) methodologies utilizing verifiable rewards. These rewards facilitate the automatic verification of final answers, transforming them into dependable training signals. Traditional pipelines have predominantly focused on optimizing the correctness of outcomes, resulting in sparse feedback for intricate, multi-step solutions and offering limited insights into intermediate reasoning errors.
To address these challenges, recent research has introduced process reward models (PRMs), designed to assess intermediate reasoning steps and provide denser supervision throughout the problem-solving process. However, practical implementations of PRMs often encounter misalignment with final correctness, occasionally rewarding reasoning that appears fluent yet culminates in incorrect answers. When treated as absolute rewards, these signals may exacerbate fluent failure modes, leading to a phenomenon known as reward hacking.
Introducing PROGRS
In response to these limitations, we propose a novel framework named PROGRS (Process Rewards for Outcome-Guided Reasoning Steps). This framework effectively leverages PRMs while maintaining a dominant emphasis on outcome correctness. Rather than treating process rewards as absolute targets, PROGRS conceptualizes them as relative preferences within defined outcome groups.
- Outcome-Conditioned Centering: One of the key innovations of PROGRS is the introduction of outcome-conditioned centering. This technique adjusts PRM scores of incorrect trajectories by shifting them to possess a zero mean within each prompt group. This adjustment eliminates systematic bias while still retaining informative rankings of reasoning steps.
- Combination of Evaluation Methods: PROGRS uniquely integrates a frozen quantile-regression PRM with a multi-scale coherence evaluator. This combination enhances the assessment of reasoning processes without the need for auxiliary objectives or additional trainable components.
Performance and Results
Evaluations conducted across multiple datasets, including MATH-500, AMC, AIME, MinervaMath, and OlympiadBench, demonstrate that PROGRS consistently outperforms traditional outcome-only baselines in terms of Pass@1 accuracy. Furthermore, the framework achieves these superior performance metrics with a reduced number of rollouts, indicating improved efficiency in the learning process.
These compelling results underscore the efficacy of outcome-conditioned centering in facilitating the safe and effective use of process rewards within the realm of mathematical reasoning. By optimizing for both intermediate steps and final outcomes, PROGRS paves the way for more robust and reliable AI systems capable of tackling complex reasoning tasks.
Conclusion
The introduction of PROGRS marks a significant advancement in the field of large language models and their application in mathematical reasoning. By balancing the focus on outcome correctness with the nuanced evaluation of intermediate reasoning processes, researchers can create AI systems that not only provide accurate answers but also demonstrate clear, logical reasoning throughout their problem-solving journey.
