Polychromic Objectives for Reinforcement Learning
Summary: arXiv:2509.25424v4 Announce Type: replace-cross
Abstract
Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks. These pretrained policies, trained on large datasets, produce generations with a broad range of promising but unrefined behaviors. Often, a critical failure mode of RLFT arises when policies lose this diversity and collapse into a handful of easily exploitable outputs. This convergence hinders exploration, which is essential for expanding the capabilities of the pretrained policy and for amplifying the benefits of test-time compute scaling.
To address this, we introduce an objective for policy gradient methods that explicitly enforces the exploration and refinement of diverse generations, which we call a polychromic objective. We then show how proximal policy optimization (PPO) can be adapted to optimize this objective.
Methodology
Our method comprises two key innovations:
- Vine Sampling: We employ vine sampling to collect on-policy rollouts, which enhances the diversity of the samples obtained during training.
- Modified Advantage Function: We modify the advantage function to reflect the advantage under our new polychromic objective, allowing for improved performance in diverse settings.
Experimental Results
We conducted a series of experiments on prominent benchmarks such as BabyAI, Minigrid, and Algorithmic Creativity. The results illustrated significant improvements in the following areas:
- Success Rates: Our method reliably solved a larger set of environment configurations, showcasing its robustness in diverse scenarios.
- Generalization: The policy demonstrated better generalization under large perturbations, indicating its adaptability to untrained conditions.
- Diverse Strategies: In pass@$k$ experiments, our policy achieved substantially higher coverage, reflecting its ability to maintain and exploit a diverse repertoire of strategies.
Conclusion
The introduction of the polychromic objective represents a significant step forward in the field of reinforcement learning. By enhancing the exploration and refinement of diverse generations, our approach mitigates the risks associated with policy convergence and facilitates broader capabilities in pretrained models. Our findings suggest that this methodology not only improves performance in specific tasks but also paves the way for future advancements in the field of artificial intelligence.
Future Work
Looking ahead, we aim to explore the integration of the polychromic objective with other reinforcement learning frameworks and to assess its impact on even more complex environments. Continued research in this area will be essential for unlocking the full potential of reinforcement learning systems and their applications across various domains.
