Co-Evolution of Policy and Internal Reward for Language Agents
Summary: arXiv:2604.03098v1 Announce Type: cross
The rapid development of large language model (LLM) agents has transformed the landscape of artificial intelligence. These agents learn through interaction with their environments, yet they face significant challenges related to long-horizon training, primarily due to sparse and delayed rewards. Traditional methods have attempted to tackle this issue through post-hoc credit assignment or the implementation of external reward models. However, these approaches often provide limited guidance during inference and tend to decouple the processes of reward enhancement and policy improvement.
Introduction to Self-Guide
In response to these challenges, we introduce a novel concept known as Self-Guide. This method generates an internal reward for language agents, thereby facilitating both inference-time guidance and training-time supervision. The mechanism works by allowing the agent to utilize Self-Guide as a short self-guidance signal to influence its next action during the inference phase. During training, this same signal is converted into a step-level internal reward, which promotes denser policy optimization.
The Co-Evolving Loop
The Self-Guide framework creates a co-evolving loop where an improved policy leads to better guidance, and this enhanced guidance, in turn, further refines the policy through internal rewards. This cyclical relationship is critical for optimizing the learning process of language agents. The implications of this approach are significant in that they suggest a shift from mere experience collection to a more sophisticated understanding of how agents can generate and hone their own internal rewards while acting and learning.
Experimental Findings
To assess the efficacy of the Self-Guide mechanism, we conducted experiments across three distinct agent benchmarks. The findings were compelling:
- Inference-time self-guidance yielded notable performance improvements, showcasing the immediate benefits of the proposed method.
- When combined with the GRPO (Generalized Reward Policy Optimization) algorithm, the joint evolution of policy and internal reward provided an additional 8% improvement over baselines that relied solely on environmental rewards.
- The results indicate a promising avenue for enhancing the capabilities of language agents through self-generated internal rewards.
Conclusion
In conclusion, the introduction of Self-Guide marks a significant advancement in the training and performance of language agents. By empowering agents to generate and refine their own internal rewards, we pave the way for a more effective learning paradigm. As this research continues to evolve, it has the potential to redefine how language agents understand and interact with their environments, ultimately leading to more intelligent and adaptable systems.
