Targeted Exploration via Unified Entropy Control for Reinforcement Learning
Summary: arXiv:2604.14646v2 Announce Type: replace
Abstract: Recent advances in reinforcement learning (RL) have improved the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs). However, the widely used Group Relative Policy Optimization (GRPO) consistently suffers from entropy collapse, causing the policy to converge prematurely and lose diversity.
Existing exploration methods introduce additional bias or variance during exploration, making it difficult to maintain optimization stability. To address these challenges, we propose a novel framework known as Unified Entropy Control for Reinforcement Learning (UEC-RL). This framework offers targeted mechanisms for both exploration and stabilization, enhancing the overall effectiveness of reinforcement learning.
Key Features of UEC-RL
- Targeted Exploration: UEC-RL activates more exploration on difficult prompts, enabling the model to search for potential and valuable reasoning trajectories. This targeted approach helps in uncovering more diverse and effective solutions.
- Entropy Stabilization: A built-in stabilizer prevents entropy from growing uncontrollably, ensuring that training remains stable as the model consolidates reliable behaviors. This dual approach maintains a balance between exploration and stability.
- Robust Optimization: By expanding the search space when necessary and maintaining robust optimization throughout training, UEC-RL ensures that the model can adapt and improve in complex environments.
Experimental Results
Experimental evaluations on both LLM and VLM reasoning tasks reveal that UEC-RL consistently outperforms existing RL baselines on key metrics such as Pass@1 and Pass@$k$. Notably, in tests conducted on the Geometry3K dataset, UEC-RL achieved a remarkable 37.9% relative improvement over GRPO.
This significant enhancement underscores UEC-RL’s ability to sustain effective exploration without compromising convergence. The results emphasize the framework’s potential as a pivotal tool for scaling RL-based reasoning in large models.
Conclusion
In conclusion, UEC-RL represents a significant advancement in the field of reinforcement learning, addressing long-standing issues such as entropy collapse and the need for stable exploration methods. By providing targeted exploration mechanisms and robust stabilization techniques, UEC-RL enhances the reasoning capabilities of both LLMs and VLMs, paving the way for more effective and diverse solutions in complex tasks.
For those interested in exploring UEC-RL further, the code is available at https://github.com/597358816/UEC-RL.
