Rectifying LLM Thought from Lens of Optimization
Summary: arXiv:2512.01925v2 Announce Type: replace-cross
Abstract: Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance.
In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training.
Introduction to RePro
RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs.
Key Features of RePro
- Surrogate Objective Function: Provides a measurable way to evaluate the effectiveness of reasoning processes.
- Dual Scoring Mechanism: Quantifies both the intensity and stability of the CoT reasoning steps.
- Composite Process-level Reward: An aggregated score that aids in optimizing the reasoning capabilities of LLMs.
- Integration with RLVR: Enhances the efficiency of reinforcement learning frameworks for LLMs.
Experimental Validation
Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs were conducted to validate the effectiveness of RePro. These evaluations were performed on benchmarks that spanned various domains, including:
- Mathematics
- Science
- Coding
The results consistently demonstrated that RePro enhances reasoning performance and mitigates suboptimal reasoning behaviors. By refining the CoT process and providing a structured approach to reasoning, LLMs exhibited improved outcomes in problem-solving tasks.
Conclusion
In summary, the introduction of RePro signifies a pivotal advancement in refining the reasoning capabilities of large language models. By viewing the reasoning process through the lens of optimization, we have developed a method that not only improves performance but also addresses the common pitfalls associated with long chain-of-thought reasoning. This work opens new avenues for further research in optimizing LLMs, ultimately contributing to more efficient and effective AI systems.
