Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training
Summary: arXiv:2604.18701v1 Announce Type: cross
Abstract
Local prediction-error-based curiosity rewards typically focus on immediate transitions without taking into account the cumulative prediction error of the world model across all transitions encountered. This article introduces a novel approach called Curiosity-Critic, which leverages the concept of cumulative prediction error improvement as an intrinsic reward. The Curiosity-Critic framework offers a tractable formulation, reducing the intrinsic reward to a per-step form that calculates the difference between the current prediction error and the asymptotic error baseline for the current state transition.
Key Innovations
- Online Baseline Estimation: The baseline error is estimated in real-time using a learned critic that is co-trained alongside the world model. This model is designed to regress a single scalar value, which allows for rapid convergence before the world model reaches saturation.
- Directed Exploration: The Curiosity-Critic framework effectively redirects exploration towards transitions that are learnable, avoiding the need for oracle knowledge about the noise floor of the environment.
- Separation of Errors: The framework distinguishes between epistemic (reducible) and aleatoric (irreducible) prediction errors in an online manner, thereby enabling more effective learning strategies.
Theoretical Underpinnings
The Curiosity-Critic mechanism provides a comprehensive framework for understanding how curiosity-driven exploration can be improved through the lens of cumulative prediction errors. Prior formulations of prediction-error curiosity, which date back to Schmidhuber (1991), are shown to be special cases of this new approach, corresponding to specific approximations of the error baseline. This connection highlights the evolution of curiosity mechanisms in reinforcement learning and underscores the relevance of cumulative error considerations.
Experimental Validation
To validate the effectiveness of the Curiosity-Critic approach, experiments were conducted in a stochastic grid world scenario. The results demonstrate that Curiosity-Critic significantly outperforms traditional prediction-error and visitation-count baselines in both convergence speed and final accuracy of the world model. The implications of these findings suggest that enhancing intrinsic rewards through cumulative prediction error could lead to more efficient learning processes in artificial intelligence systems.
Conclusion
The introduction of Curiosity-Critic marks a significant advancement in the field of reinforcement learning by integrating cumulative prediction error into the intrinsic reward framework. This methodology not only enhances exploration efficiency but also provides a deeper understanding of the underlying mechanisms of curiosity in AI. Future research can build upon these findings to further refine curiosity-driven learning algorithms and explore their applications in more complex environments.
