Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
Summary: arXiv:2604.02986v1 Announce Type: cross
Abstract: Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.
Introduction
Reinforcement learning from human feedback (RLHF) has garnered significant attention for its ability to align AI systems with human preferences. However, the reliance on reward models (RMs) introduces vulnerabilities, particularly the phenomenon known as reward hacking.
The Issue of Reward Hacking
Reward hacking occurs when a policy maximizes a learned proxy reward at the expense of the true quality of outcomes. This often results in a plateau or a degradation in performance, as the system learns to exploit weaknesses in the reward structure rather than genuinely improve its outputs. A critical insight into this problem is the role of advantage signs, which can become flipped, inadvertently encouraging the generation of suboptimal responses.
Understanding Advantage Sign Robustness
Our research posits that the core of reward hacking can often be traced back to these flipped advantage signs. When the advantage sign is reversed, updates to the policy can inadvertently increase the likelihood of undesirable outputs, counteracting the intended learning objectives.
- Flipped advantage signs lead to adverse policy updates.
- The adversarial perturbation in RM parameters aids in identifying critical thresholds.
- Establishing a certified sign-preservation radius is vital for robust policy optimization.
Introducing Sign-Certified Policy Optimization (SignCert-PO)
To address these challenges, we introduce Sign-Certified Policy Optimization (SignCert-PO). This novel approach focuses on down-weighting non-robust completions during policy gradient updates, thereby fostering more reliable policy enhancements.
- SignCert-PO operates purely at the policy optimization stage.
- It requires only the RM parameters and on-policy completions, making it lightweight.
- Unlike previous methodologies, it does not necessitate multiple RMs or access to training data.
Results and Implications
Evaluations on TL;DR summarization and AlpacaFarm benchmarks have demonstrated that SignCert-PO consistently outperforms baseline methods. By effectively reducing instances of reward hacking, this approach not only enhances the robustness of RLHF systems but also contributes to the broader objective of aligning AI with human values.
Conclusion
As AI systems continue to evolve, the need for secure and reliable reinforcement learning strategies becomes paramount. The insights gained from our exploration of advantage sign robustness and the implementation of SignCert-PO represent a significant step forward in mitigating the risks associated with reward hacking. These advancements pave the way for more effective and ethically aligned AI applications.
