Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration
Summary: arXiv:2604.02869v1 Announce Type: new
Abstract: Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Relative Policy Optimization) combined with GTPO (Generalized Token-level Policy Optimization) for training a tool-calling agent on realistic customer service tasks with an LLM-based user simulator.
Challenges in Training Tool-Calling Agents
Training agents that can effectively communicate and perform tasks over multiple conversation turns poses significant challenges. The two main issues are:
- Sparse Outcome Rewards: Rewards are often infrequent, making it difficult for agents to learn from their actions.
- Credit Assignment: Determining which actions in a conversation led to successful outcomes is complex.
Innovative Approaches to Reinforcement Learning
In this study, we introduce the MT-GRPO framework, which employs both Multi-Turn Group Relative Policy Optimization and Generalized Token-level Policy Optimization. This hybrid approach is designed to enhance training efficiency and effectiveness for tool-calling agents.
Key Findings
Through a systematic analysis of training rollouts, we uncovered several crucial insights that contributed to the development of our methodology:
- Naively designed dense per-turn rewards can lead to a performance degradation of up to 14 percentage points. This is primarily due to misalignment between reward discriminativeness and advantage direction.
- Our Iterative Reward Calibration methodology is pivotal for designing per-turn rewards. It leverages empirical discriminative analysis of rollout data to ensure alignment between rewards and agent learning.
- The GTPO hybrid advantage formulation effectively addresses the advantage misalignment problem, leading to improved agent performance.
Performance Improvements
When applying our approach to the Tau-Bench airline benchmark, we observed significant performance enhancements:
- Qwen3.5-4B improved from 63.8% to 66.7% (+2.9 percentage points).
- Qwen3-30B-A3B increased from 58.0% to 69.5% (+11.5 percentage points).
- The trained 4B model outperformed GPT-4.1 (49.4%) and GPT-4o (42.8%) despite being 50 times smaller.
- The 30.5B MoE model approached Claude Sonnet 4.5 (70.0%), showcasing the effectiveness of our methodology.
Conclusion and Future Work
To our knowledge, these are the first published reinforcement learning training results on the Tau-Bench benchmark. We are committed to advancing the field of tool-calling agents and have released our code, reward calibration analysis, and training recipes for further research and development.
