Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
Summary: arXiv:2508.07629v4 Announce Type: replace-cross
Abstract: We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details.
Introduction
Klear-Reasoner represents a significant advancement in the field of AI reasoning models. With its ability to engage in complex problem-solving, it sets a new standard for performance across various benchmarking tests. This article delves into the specifics of the Klear-Reasoner model, its training processes, and its innovative Gradient-Preserving Clipping Policy Optimization (GPPO).
Training Process
The Klear-Reasoner model’s post-training workflow is comprehensive, consisting of several key components:
- Data Preparation: Preparation of high-quality datasets that are crucial for effective training.
- Long Chain-of-Thought Supervised Fine-Tuning (long CoT SFT): A unique method that enhances the model’s reasoning capabilities.
- Reinforcement Learning (RL): Integrating RL techniques to further improve the model’s learning efficiency.
- Ablation Studies: Detailed evaluations of each experimental component to determine their impact on performance.
Findings
Our experiments revealed several critical insights regarding the SFT data:
- A smaller number of high-quality data sources proved to be more effective than a larger variety of less reliable sources.
- Challenging samples, even without accuracy filtering, yielded better results, indicating that complexity can enhance learning.
Challenges in Current Clipping Mechanisms
We identified two major issues with existing clipping mechanisms in reinforcement learning:
- Clipping tends to suppress vital exploration signals, which are essential for model improvement.
- Current methods often overlook suboptimal trajectories that could provide valuable learning opportunities.
Gradient-Preserving Clipping Policy Optimization (GPPO)
To combat these challenges, we introduced GPPO, a novel approach that allows for gentle backpropagation of gradients from clipped tokens. This innovation not only enhances the model’s exploration capabilities but also improves its efficiency in learning from negative samples.
Performance Metrics
Klear-Reasoner has demonstrated exceptional reasoning capabilities across various domains, achieving remarkable scores on several benchmarks:
- AIME 2024: 90.5%
- AIME 2025: 83.2%
- LiveCodeBench V5: 66.0%
- LiveCodeBench V6: 58.1%
Conclusion
Klear-Reasoner’s innovative approach and robust training methodology position it as a frontrunner in AI reasoning models. Its ability to achieve high performance across various benchmarks highlights its potential for future applications in complex problem-solving scenarios.
