T-STAR: Tree-Based Policy Optimization for Multi-Turn Agents

Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

Summary: arXiv:2604.07165v1 Announce Type: new

Abstract

Reinforcement learning for Large Language Model agents is often hindered by sparse rewards in multi-step reasoning tasks. Existing approaches like Group Relative Policy Optimization treat sampled trajectories as independent chains, assigning uniform credit to all steps in each chain and ignoring the existence of critical steps that may disproportionately impact reasoning outcomes. In this paper, we propose T-STAR (Tree-structured Self-Taught Agent Rectification), a framework that recovers the latent correlated reward structure across seemingly independent trajectories.

Introduction

The advancement of reinforcement learning (RL) has opened new avenues for developing intelligent agents capable of complex reasoning tasks. However, the challenge of sparse rewards in multi-step reasoning scenarios persists. The conventional methods often overlook the intrinsic relationships between steps in a trajectory, leading to suboptimal learning outcomes. Our proposed framework, T-STAR, addresses these limitations by leveraging a tree-structured approach to optimize the learning process.

T-STAR Framework

T-STAR operates under the premise that trajectories can be consolidated into a unified structure known as the Cognitive Tree. This tree is formed by identifying and merging functionally similar steps or nodes from different trajectories, thereby enhancing the agent’s understanding of the reward landscape.

Key Components

Introspective Valuation Mechanism: This component back-propagates trajectory-level rewards through the Cognitive Tree, allowing for a refined notion of variance-reduced relative advantage at the step level.
In-Context Thought Grafting: This innovative method facilitates corrective reasoning by contrasting successful and failed branches at critical divergence points. It synthesizes insights from both paths to improve decision-making.
Surgical Policy Optimization: By focusing on critical points with concentrated policy gradient information, this approach employs a Bradley-Terry type of surgical loss, enhancing the learning efficiency and accuracy of the agent.

Experimental Results

We conducted extensive experiments across various benchmarks, including embodied, interactive, reasoning, and planning tasks. The results indicate that T-STAR consistently outperforms strong baselines, particularly in scenarios requiring extended reasoning chains. The framework’s ability to leverage the Cognitive Tree structure significantly enhances the agent’s capability to navigate complex decision-making environments.

Conclusion

T-STAR represents a significant advancement in the field of reinforcement learning for Large Language Model agents. By recovering latent reward structures and enhancing the learning process through tree-based techniques, we pave the way for more effective multi-turn reasoning capabilities. The results suggest that our approach not only addresses the limitations of existing methods but also sets a new standard for future research in this domain.

For further information, please refer to the full paper available on arXiv.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

T-STAR: Tree-Based Policy Optimization for Multi-Turn Agents

Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

Abstract

Introduction

T-STAR Framework

Key Components

Experimental Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related