Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization
Summary: arXiv:2604.07165v1 Announce Type: new
Abstract
Reinforcement learning for Large Language Model agents is often hindered by sparse rewards in multi-step reasoning tasks. Existing approaches like Group Relative Policy Optimization treat sampled trajectories as independent chains, assigning uniform credit to all steps in each chain and ignoring the existence of critical steps that may disproportionately impact reasoning outcomes. In this paper, we propose T-STAR (Tree-structured Self-Taught Agent Rectification), a framework that recovers the latent correlated reward structure across seemingly independent trajectories.
Introduction
The advancement of reinforcement learning (RL) has opened new avenues for developing intelligent agents capable of complex reasoning tasks. However, the challenge of sparse rewards in multi-step reasoning scenarios persists. The conventional methods often overlook the intrinsic relationships between steps in a trajectory, leading to suboptimal learning outcomes. Our proposed framework, T-STAR, addresses these limitations by leveraging a tree-structured approach to optimize the learning process.
T-STAR Framework
T-STAR operates under the premise that trajectories can be consolidated into a unified structure known as the Cognitive Tree. This tree is formed by identifying and merging functionally similar steps or nodes from different trajectories, thereby enhancing the agent’s understanding of the reward landscape.
Key Components
- Introspective Valuation Mechanism: This component back-propagates trajectory-level rewards through the Cognitive Tree, allowing for a refined notion of variance-reduced relative advantage at the step level.
- In-Context Thought Grafting: This innovative method facilitates corrective reasoning by contrasting successful and failed branches at critical divergence points. It synthesizes insights from both paths to improve decision-making.
- Surgical Policy Optimization: By focusing on critical points with concentrated policy gradient information, this approach employs a Bradley-Terry type of surgical loss, enhancing the learning efficiency and accuracy of the agent.
Experimental Results
We conducted extensive experiments across various benchmarks, including embodied, interactive, reasoning, and planning tasks. The results indicate that T-STAR consistently outperforms strong baselines, particularly in scenarios requiring extended reasoning chains. The framework’s ability to leverage the Cognitive Tree structure significantly enhances the agent’s capability to navigate complex decision-making environments.
Conclusion
T-STAR represents a significant advancement in the field of reinforcement learning for Large Language Model agents. By recovering latent reward structures and enhancing the learning process through tree-based techniques, we pave the way for more effective multi-turn reasoning capabilities. The results suggest that our approach not only addresses the limitations of existing methods but also sets a new standard for future research in this domain.
For further information, please refer to the full paper available on arXiv.
