Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning
Summary: arXiv:2604.15414v1 Announce Type: cross
Abstract: Continual reinforcement learning must balance retention with adaptation, yet many methods still rely on single-model preservation, committing to one evolving policy as the main reusable solution across tasks. Even when a previously successful policy is retained, it may no longer provide a reliable starting point for rapid adaptation after interference, reflecting a form of loss of plasticity that single-policy preservation cannot address.
Inspired by quality-diversity methods, we introduce TeLAPA (Transfer-Enabled Latent-Aligned Policy Archives), a continual RL framework that organizes behaviorally diverse policy neighborhoods into per-task archives and maintains a shared latent space so that archived policies remain comparable and reusable under non-stationary drift. This perspective shifts continual RL from retaining isolated solutions to maintaining skill-aligned neighborhoods with competent and behaviorally related policies that support future relearning.
Key Features of TeLAPA
- Behaviorally Diverse Policy Neighborhoods: TeLAPA organizes policies into distinct neighborhoods that reflect different strategies and competencies relevant to specific tasks.
- Shared Latent Space: The framework maintains a shared latent space, allowing for the comparison and reuse of policies even as tasks evolve over time.
- Support for Non-Stationary Drift: By keeping policies in a skill-aligned neighborhood, TeLAPA adapts more efficiently to shifting task requirements.
Performance and Analysis
In our MiniGrid CL setting, TeLAPA demonstrates significant improvements in various aspects of continual learning:
- Increased Task Success: TeLAPA successfully learns more tasks compared to traditional single-model approaches.
- Faster Recovery: The framework allows for quicker recovery of competence on revisited tasks after instances of interference.
- Higher Performance Retention: TeLAPA maintains superior performance across a sequence of tasks, showcasing its effectiveness in continual settings.
Our analyses reveal that source-optimal policies, those that performed well in previous tasks, often do not translate to transfer-optimal solutions within a local competent neighborhood. This finding emphasizes that effective reuse of knowledge in continual reinforcement learning requires retaining and selecting among multiple nearby alternatives, rather than narrowing them down to a single representative policy.
Conclusion
The findings from TeLAPA advocate for a paradigm shift in continual reinforcement learning. By focusing on maintaining reusable and competent policy neighborhoods, we pave the way for more plastic lifelong agents capable of adapting to new challenges without the drawbacks of single-model preservation. This approach not only enhances performance but also ensures that agents can retain the flexibility needed to navigate an ever-changing environment.
