EXPO: Adaptive Policy Optimization for AI Exploration

Date:

Exploration-Prioritized Policy Optimization via Adaptive KL Regulation and Gaussian Curriculum Sampling

In a significant advancement in the field of artificial intelligence, researchers have introduced a novel approach to reinforcement learning with the publication of the paper titled “expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling,” available on arXiv under the identifier 2605.09923v1. This innovative method addresses key inefficiencies identified in the well-established Group Relative Policy Optimization (GRPO) algorithm, which has been widely utilized for mathematical reasoning in large language models (LLMs).

Background on Reinforcement Learning with Verifiable Rewards

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a standard approach for enhancing the mathematical reasoning capabilities of LLMs. Within this paradigm, GRPO has played a crucial role as the leading algorithm. However, researchers have discovered two critical inefficiencies that hinder GRPO’s performance:

  • Fixed KL Penalty Coefficient: The static nature of the KL penalty coefficient limits the model’s ability to explore new policy avenues, especially during phases when significant deviations from the reference policy are necessary.
  • Uniform Sampling of Training Questions: The approach of uniformly sampling training questions overlooks the potential benefits of moderately challenging problems, which are known to provide more informative gradient signals for optimization.

Introducing EXPO: A Novel Approach

To address these inefficiencies, the researchers proposed Exploration-Prioritized Policy Optimization (EXPO), which introduces two lightweight plug-in modules:

  • Accuracy-Conditioned KL Scaling (AKL): This module dynamically adjusts the KL regularization strength using a smooth nonlinear function based on the batch average accuracy. It relaxes the penalty during underperformance and strengthens it when the model achieves satisfactory results, thereby promoting a more flexible exploration strategy.
  • Gaussian Curriculum Sampling (GCS): This innovative sampling technique assigns weights to training questions following a Gaussian distribution centered around moderate accuracy levels (approximately 0.5). This focus allows the model to concentrate on its learning frontier, thereby enhancing the efficiency of the training process.

Experimental Results and Performance Gains

The effectiveness of EXPO was rigorously tested through extensive experiments conducted on two prominent models: DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base. The researchers evaluated these models across six mathematical reasoning benchmarks, yielding compelling results:

  • EXPO demonstrated a substantial improvement over the vanilla GRPO, achieving an absolute gain of 13.34 on the AIME 2025 pass@32 metric, increasing the success rate from 63.33 percent to 76.67 percent.
  • Additionally, the average pass@32 improvement for the 8B model was recorded at 2.66, indicating a consistent trend of enhanced performance.
  • Notably, the performance gains observed on the pass@32 metric were significantly larger compared to pass@1, showcasing EXPO’s effectiveness in expanding the model’s exploration boundary while operating under a fixed inference cost budget.

Conclusion

The introduction of EXPO marks a pivotal step forward in the optimization of reinforcement learning strategies for LLMs. By addressing the limitations of GRPO through adaptive KL regulation and focused sampling techniques, this research paves the way for more effective and efficient mathematical reasoning capabilities in artificial intelligence systems.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.