PROGRS: Enhancing LLM Reasoning with Process Rewards

Date:

LLM Reasoning with Process Rewards for Outcome-Guided Steps

Summary: arXiv:2604.02341v1 Announce Type: cross

Abstract: The landscape of mathematical reasoning in large language models (LLMs) has witnessed significant advancements through the application of reinforcement learning (RL) methodologies utilizing verifiable rewards. These rewards facilitate the automatic verification of final answers, transforming them into dependable training signals. Traditional pipelines have predominantly focused on optimizing the correctness of outcomes, resulting in sparse feedback for intricate, multi-step solutions and offering limited insights into intermediate reasoning errors.

To address these challenges, recent research has introduced process reward models (PRMs), designed to assess intermediate reasoning steps and provide denser supervision throughout the problem-solving process. However, practical implementations of PRMs often encounter misalignment with final correctness, occasionally rewarding reasoning that appears fluent yet culminates in incorrect answers. When treated as absolute rewards, these signals may exacerbate fluent failure modes, leading to a phenomenon known as reward hacking.

Introducing PROGRS

In response to these limitations, we propose a novel framework named PROGRS (Process Rewards for Outcome-Guided Reasoning Steps). This framework effectively leverages PRMs while maintaining a dominant emphasis on outcome correctness. Rather than treating process rewards as absolute targets, PROGRS conceptualizes them as relative preferences within defined outcome groups.

  • Outcome-Conditioned Centering: One of the key innovations of PROGRS is the introduction of outcome-conditioned centering. This technique adjusts PRM scores of incorrect trajectories by shifting them to possess a zero mean within each prompt group. This adjustment eliminates systematic bias while still retaining informative rankings of reasoning steps.
  • Combination of Evaluation Methods: PROGRS uniquely integrates a frozen quantile-regression PRM with a multi-scale coherence evaluator. This combination enhances the assessment of reasoning processes without the need for auxiliary objectives or additional trainable components.

Performance and Results

Evaluations conducted across multiple datasets, including MATH-500, AMC, AIME, MinervaMath, and OlympiadBench, demonstrate that PROGRS consistently outperforms traditional outcome-only baselines in terms of Pass@1 accuracy. Furthermore, the framework achieves these superior performance metrics with a reduced number of rollouts, indicating improved efficiency in the learning process.

These compelling results underscore the efficacy of outcome-conditioned centering in facilitating the safe and effective use of process rewards within the realm of mathematical reasoning. By optimizing for both intermediate steps and final outcomes, PROGRS paves the way for more robust and reliable AI systems capable of tackling complex reasoning tasks.

Conclusion

The introduction of PROGRS marks a significant advancement in the field of large language models and their application in mathematical reasoning. By balancing the focus on outcome correctness with the nuanced evaluation of intermediate reasoning processes, researchers can create AI systems that not only provide accurate answers but also demonstrate clear, logical reasoning throughout their problem-solving journey.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.