No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning
Summary: arXiv:2601.06794v2 Announce Type: replace
Abstract: Critique-guided reinforcement learning (RL) has emerged as a powerful paradigm for training LLM agents by augmenting sparse outcome rewards with natural-language feedback. However, current methods often rely on static or offline critic models, which fail to adapt as the policy evolves. In on-policy RL, the agent’s error patterns shift over time, causing stationary critics to become stale and providing feedback of diminishing utility.
To address this, we introduce ECHO (Evolving Critic for Hindsight-Guided Optimization), a framework that jointly optimizes the policy and critic through a synchronized co-evolutionary loop. ECHO utilizes a cascaded rollout mechanism where the critic generates multiple diagnoses for an initial trajectory, followed by policy refinement to enable group-structured advantage estimation.
Challenges in Current RL Approaches
Current critique-guided reinforcement learning methods face significant challenges:
- Static Critics: Most existing models use static critics that do not adapt to changes in the agent’s policy.
- Diminishing Returns: As the agent learns and improves, the feedback from these critics becomes less relevant and less effective.
- Error Pattern Shifts: The evolving nature of the agent’s learning leads to shifting error patterns that stationary critics cannot accurately assess.
Introducing ECHO
ECHO addresses these challenges by implementing a novel approach to reinforcement learning:
- Cascaded Rollout Mechanism: This mechanism allows the critic to generate multiple diagnoses for a single trajectory, providing richer feedback for policy refinement.
- Group-Structured Advantage Estimation: By refining the policy based on varied feedback, ECHO enhances the learning process and addresses learning plateaus.
- Saturation-Aware Gain Shaping: This objective rewards the critic for facilitating incremental improvements in high-performing trajectories, ensuring that learning remains dynamic and effective.
Experimental Results
Initial experiments with ECHO demonstrate promising results:
- Stable Training: ECHO provides a more stable training process compared to traditional methods.
- Higher Success Rates: Agents trained using ECHO exhibit improved long-horizon task success across various open-world environments.
- Dynamic Feedback Loop: The synchronized updates between the policy and critic ensure that feedback remains relevant and actionable throughout the learning process.
Conclusion
The introduction of ECHO signifies a substantial advancement in the field of reinforcement learning. By facilitating a co-evolutionary relationship between the critic and policy, ECHO not only enhances the learning experience but also paves the way for more effective training of intelligent agents in complex environments. As the field continues to evolve, the insights gained from ECHO may lead to further innovations in adaptive learning methodologies.
