EXPO: Adaptive Policy Optimization for AI Exploration

Exploration-Prioritized Policy Optimization via Adaptive KL Regulation and Gaussian Curriculum Sampling

In a significant advancement in the field of artificial intelligence, researchers have introduced a novel approach to reinforcement learning with the publication of the paper titled “expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling,” available on arXiv under the identifier 2605.09923v1. This innovative method addresses key inefficiencies identified in the well-established Group Relative Policy Optimization (GRPO) algorithm, which has been widely utilized for mathematical reasoning in large language models (LLMs).

Background on Reinforcement Learning with Verifiable Rewards

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a standard approach for enhancing the mathematical reasoning capabilities of LLMs. Within this paradigm, GRPO has played a crucial role as the leading algorithm. However, researchers have discovered two critical inefficiencies that hinder GRPO’s performance:

Fixed KL Penalty Coefficient: The static nature of the KL penalty coefficient limits the model’s ability to explore new policy avenues, especially during phases when significant deviations from the reference policy are necessary.
Uniform Sampling of Training Questions: The approach of uniformly sampling training questions overlooks the potential benefits of moderately challenging problems, which are known to provide more informative gradient signals for optimization.

Introducing EXPO: A Novel Approach

To address these inefficiencies, the researchers proposed Exploration-Prioritized Policy Optimization (EXPO), which introduces two lightweight plug-in modules:

Accuracy-Conditioned KL Scaling (AKL): This module dynamically adjusts the KL regularization strength using a smooth nonlinear function based on the batch average accuracy. It relaxes the penalty during underperformance and strengthens it when the model achieves satisfactory results, thereby promoting a more flexible exploration strategy.
Gaussian Curriculum Sampling (GCS): This innovative sampling technique assigns weights to training questions following a Gaussian distribution centered around moderate accuracy levels (approximately 0.5). This focus allows the model to concentrate on its learning frontier, thereby enhancing the efficiency of the training process.

Experimental Results and Performance Gains

The effectiveness of EXPO was rigorously tested through extensive experiments conducted on two prominent models: DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base. The researchers evaluated these models across six mathematical reasoning benchmarks, yielding compelling results:

EXPO demonstrated a substantial improvement over the vanilla GRPO, achieving an absolute gain of 13.34 on the AIME 2025 pass@32 metric, increasing the success rate from 63.33 percent to 76.67 percent.
Additionally, the average pass@32 improvement for the 8B model was recorded at 2.66, indicating a consistent trend of enhanced performance.
Notably, the performance gains observed on the pass@32 metric were significantly larger compared to pass@1, showcasing EXPO’s effectiveness in expanding the model’s exploration boundary while operating under a fixed inference cost budget.

Conclusion

The introduction of EXPO marks a pivotal step forward in the optimization of reinforcement learning strategies for LLMs. By addressing the limitations of GRPO through adaptive KL regulation and focused sampling techniques, this research paves the way for more effective and efficient mathematical reasoning capabilities in artificial intelligence systems.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

EXPO: Adaptive Policy Optimization for AI Exploration

Exploration-Prioritized Policy Optimization via Adaptive KL Regulation and Gaussian Curriculum Sampling

Background on Reinforcement Learning with Verifiable Rewards

Introducing EXPO: A Novel Approach

Experimental Results and Performance Gains

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related