Multi-Turn RL for Tool-Calling Agents with Reward Calibration

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

Summary: arXiv:2604.02869v1 Announce Type: new

Abstract: Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Relative Policy Optimization) combined with GTPO (Generalized Token-level Policy Optimization) for training a tool-calling agent on realistic customer service tasks with an LLM-based user simulator.

Challenges in Training Tool-Calling Agents

Training agents that can effectively communicate and perform tasks over multiple conversation turns poses significant challenges. The two main issues are:

Sparse Outcome Rewards: Rewards are often infrequent, making it difficult for agents to learn from their actions.
Credit Assignment: Determining which actions in a conversation led to successful outcomes is complex.

Innovative Approaches to Reinforcement Learning

In this study, we introduce the MT-GRPO framework, which employs both Multi-Turn Group Relative Policy Optimization and Generalized Token-level Policy Optimization. This hybrid approach is designed to enhance training efficiency and effectiveness for tool-calling agents.

Key Findings

Through a systematic analysis of training rollouts, we uncovered several crucial insights that contributed to the development of our methodology:

Naively designed dense per-turn rewards can lead to a performance degradation of up to 14 percentage points. This is primarily due to misalignment between reward discriminativeness and advantage direction.
Our Iterative Reward Calibration methodology is pivotal for designing per-turn rewards. It leverages empirical discriminative analysis of rollout data to ensure alignment between rewards and agent learning.
The GTPO hybrid advantage formulation effectively addresses the advantage misalignment problem, leading to improved agent performance.

Performance Improvements

When applying our approach to the Tau-Bench airline benchmark, we observed significant performance enhancements:

Qwen3.5-4B improved from 63.8% to 66.7% (+2.9 percentage points).
Qwen3-30B-A3B increased from 58.0% to 69.5% (+11.5 percentage points).
The trained 4B model outperformed GPT-4.1 (49.4%) and GPT-4o (42.8%) despite being 50 times smaller.
The 30.5B MoE model approached Claude Sonnet 4.5 (70.0%), showcasing the effectiveness of our methodology.

Conclusion and Future Work

To our knowledge, these are the first published reinforcement learning training results on the Tau-Bench benchmark. We are committed to advancing the field of tool-calling agents and have released our code, reward calibration analysis, and training recipes for further research and development.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Multi-Turn RL for Tool-Calling Agents with Reward Calibration

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

Challenges in Training Tool-Calling Agents

Innovative Approaches to Reinforcement Learning

Key Findings

Performance Improvements

Conclusion and Future Work

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related