Optimizing LLM Reasoning with RePro Method

Rectifying LLM Thought from Lens of Optimization

Summary: arXiv:2512.01925v2 Announce Type: replace-cross

Abstract: Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance.

In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training.

Introduction to RePro

RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs.

Key Features of RePro

Surrogate Objective Function: Provides a measurable way to evaluate the effectiveness of reasoning processes.
Dual Scoring Mechanism: Quantifies both the intensity and stability of the CoT reasoning steps.
Composite Process-level Reward: An aggregated score that aids in optimizing the reasoning capabilities of LLMs.
Integration with RLVR: Enhances the efficiency of reinforcement learning frameworks for LLMs.

Experimental Validation

Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs were conducted to validate the effectiveness of RePro. These evaluations were performed on benchmarks that spanned various domains, including:

Mathematics
Science
Coding

The results consistently demonstrated that RePro enhances reasoning performance and mitigates suboptimal reasoning behaviors. By refining the CoT process and providing a structured approach to reasoning, LLMs exhibited improved outcomes in problem-solving tasks.

Conclusion

In summary, the introduction of RePro signifies a pivotal advancement in refining the reasoning capabilities of large language models. By viewing the reasoning process through the lens of optimization, we have developed a method that not only improves performance but also addresses the common pitfalls associated with long chain-of-thought reasoning. This work opens new avenues for further research in optimizing LLMs, ultimately contributing to more efficient and effective AI systems.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Optimizing LLM Reasoning with RePro Method

Rectifying LLM Thought from Lens of Optimization

Introduction to RePro

Key Features of RePro

Experimental Validation

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related