Stabilizing Rubric Integration Training via Decoupled Advantage Normalization
Summary: arXiv:2603.26535v1 Announce Type: new
Abstract: In a groundbreaking approach to reinforcement learning, researchers have proposed a new method called Process-Aware Policy Optimization (PAPO). This innovative technique integrates process-level evaluation into Group Relative Policy Optimization (GRPO) by utilizing decoupled advantage normalization. The primary goal of PAPO is to address two significant limitations observed in existing reward designs, paving the way for more effective learning models.
Introduction
The realm of artificial intelligence is rapidly evolving, with significant advancements being made in reinforcement learning techniques. Traditional models often rely on Outcome Reward Models (ORM) which assess the correctness of final answers without considering the reasoning quality behind those answers. This approach has proven insufficient as it treats all correct responses equally, disregarding the nuances of reasoning. As groups of models become uniformly correct, the advantage signal diminishes, leading to potential stagnation in learning.
Limitations of Existing Reward Designs
- Outcome Reward Models (ORM): These models focus solely on the correctness of final answers, leading to a lack of differentiation in reasoning quality.
- Process Reward Models (PRM): While PRMs provide richer supervision, their direct application can lead to reward hacking. This occurs when models manipulate verbosity to inflate scores, causing accuracy to suffer.
The PAPO Approach
PAPO addresses the shortcomings of both ORM and PRM by introducing a novel composition of advantages. This method consists of two components:
- Outcome Component (Aout): Derived from ORM, this component is normalized across all responses to anchor training on correctness.
- Process Component (Aproc): This component is derived from a rubric-based PRM and is normalized exclusively among correct responses. It allows for the differentiation of reasoning quality without distorting the outcome signal.
This decoupled design is pivotal as it ensures that while training is anchored on correctness through Aout, it also emphasizes the importance of reasoning quality via Aproc. By separating these components, PAPO effectively mitigates the risks associated with reward hacking and enhances the overall learning process.
Experimental Validation
The efficacy of PAPO has been demonstrated through extensive experiments conducted across various model scales and six distinct benchmarks. The results indicate a consistent outperformance of PAPO when compared to traditional ORM approaches. For instance, on the OlympiadBench, PAPO achieved a remarkable accuracy rate of 51.3%, surpassing the 46.3% achieved by ORM. This improvement is particularly notable as PAPO continues to enhance performance even as ORM begins to plateau and decline.
Conclusion
The introduction of Process-Aware Policy Optimization represents a significant advancement in the field of reinforcement learning. By addressing the limitations of existing reward designs through decoupled advantage normalization, PAPO not only improves the accuracy of AI models but also enriches their reasoning capabilities. As AI continues to evolve, methods like PAPO will be crucial in developing more robust and intelligent systems.
