Safe Reinforcement Learning with Preference-based Constraint Inference
Summary: arXiv:2603.23565v1 Announce Type: cross
Abstract
Safe reinforcement learning (RL) is a standard paradigm for safety-critical decision making. However, real-world safety constraints can be complex, subjective, and even hard to explicitly specify. Existing works on constraint inference rely on restrictive assumptions or extensive expert demonstrations, which is not realistic in many real-world applications. How to cheaply and reliably learn these constraints is the major challenge we focus on in this study.
While inferring constraints from human preferences offers a data-efficient alternative, we identify that the popular Bradley-Terry (BT) models fail to capture the asymmetric, heavy-tailed nature of safety costs, resulting in risk underestimation. It is still rare in the literature to understand the impacts of BT models on the downstream policy learning. To address these knowledge gaps, we propose a novel approach namely Preference-based Constrained Reinforcement Learning (PbCRL).
Introduction
In the realm of reinforcement learning, ensuring safety during decision-making processes is critical, particularly in applications like autonomous driving, healthcare, and robotics. The traditional approaches to constraint inference often fall short in capturing the complexities inherent in real-world scenarios. Our research aims to bridge this gap by introducing PbCRL, which effectively utilizes human preferences to infer constraints while addressing the limitations of existing models.
Key Innovations
- Dead Zone Mechanism: We introduce a novel dead zone mechanism into preference modeling. This innovation theoretically proves to encourage heavy-tailed cost distributions, achieving better constraint alignment.
- Signal-to-Noise Ratio (SNR) Loss: Incorporating SNR loss into our framework encourages exploration by accounting for cost variances, ultimately benefiting policy learning.
- Two-Stage Training Strategy: We deploy a two-stage training strategy that reduces online labeling burdens while adaptively enhancing constraint satisfaction.
Empirical Results
Our empirical findings demonstrate that PbCRL achieves superior alignment with true safety requirements compared to existing models. The results indicate that our method not only enhances safety but also improves overall reward outcomes. This positions PbCRL as a promising solution for constraint inference in safe reinforcement learning contexts.
Conclusion
In conclusion, our work explores an innovative and effective approach for constraint inference in safe reinforcement learning. By addressing the shortcomings of existing methods and introducing new mechanisms, PbCRL shows great potential for application in various safety-critical domains. As research in this field progresses, we anticipate that our findings will contribute significantly to the development of safer AI systems.
