Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
Summary: arXiv:2510.10959v3 Announce Type: replace-cross
Abstract: Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER)–a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.
Introduction
The emergence of Large Language Models (LLMs) has revolutionized the field of artificial intelligence, particularly in reasoning tasks. As these models evolve, the integration of Reinforcement Learning with Verifiable Rewards (RLVR) has been pivotal in enhancing their reasoning abilities. Despite its potential, RLVR faces significant challenges, primarily due to policy entropy collapse.
The Challenge of Policy Entropy Collapse
Policy entropy collapse occurs when the policy becomes overly deterministic, thereby stifling exploration. This lack of exploration can severely impact the model’s reasoning performance, leading to suboptimal outcomes. Traditional approaches to mitigate this issue involve entropy regularization; however, the effectiveness of these methods is often inconsistent due to the reliance on a fixed coefficient that does not account for the varying complexities of different tasks.
Understanding Entropy Regularization
Entropy regularization aims to maintain a level of randomness in the policy to promote exploration. However, our research indicates that the application of a static coefficient may not be sufficient. We have identified two critical insights:
- Tasks of varying difficulty require different levels of exploration intensity.
- To achieve balanced exploration, policy entropy should remain within a moderate range, typically below its starting level.
Introducing Adaptive Entropy Regularization (AER)
In light of these findings, we propose Adaptive Entropy Regularization (AER). This innovative framework is designed to dynamically balance exploration and exploitation through three key components:
- Difficulty-aware Coefficient Allocation: Adjusts the entropy coefficient based on the specific challenges presented by each task.
- Initial-anchored Target Entropy: Establishes a baseline for entropy that adapts over time to reflect the learning process.
- Dynamic Global Coefficient Adjustment: Modifies the entropy coefficient in real-time, promoting optimal exploration strategies.
Empirical Results
Our experiments conducted on various mathematical reasoning benchmarks demonstrate that AER significantly outperforms existing baselines. Not only does it enhance reasoning accuracy, but it also improves the exploration capabilities of LLMs, paving the way for more robust and versatile AI systems.
Conclusion
As AI continues to advance, the need for effective exploration strategies in LLMs becomes increasingly important. The introduction of Adaptive Entropy Regularization provides a promising solution to the challenges posed by policy entropy collapse. By embracing a dynamic approach to entropy management, we can unlock the full potential of reinforcement learning in enhancing reasoning capabilities in LLMs.
