Multi-Turn RL for Tool-Calling Agents with Reward Calibration

Date:

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

Summary: arXiv:2604.02869v1 Announce Type: new

Abstract: Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Relative Policy Optimization) combined with GTPO (Generalized Token-level Policy Optimization) for training a tool-calling agent on realistic customer service tasks with an LLM-based user simulator.

Challenges in Training Tool-Calling Agents

Training agents that can effectively communicate and perform tasks over multiple conversation turns poses significant challenges. The two main issues are:

  • Sparse Outcome Rewards: Rewards are often infrequent, making it difficult for agents to learn from their actions.
  • Credit Assignment: Determining which actions in a conversation led to successful outcomes is complex.

Innovative Approaches to Reinforcement Learning

In this study, we introduce the MT-GRPO framework, which employs both Multi-Turn Group Relative Policy Optimization and Generalized Token-level Policy Optimization. This hybrid approach is designed to enhance training efficiency and effectiveness for tool-calling agents.

Key Findings

Through a systematic analysis of training rollouts, we uncovered several crucial insights that contributed to the development of our methodology:

  • Naively designed dense per-turn rewards can lead to a performance degradation of up to 14 percentage points. This is primarily due to misalignment between reward discriminativeness and advantage direction.
  • Our Iterative Reward Calibration methodology is pivotal for designing per-turn rewards. It leverages empirical discriminative analysis of rollout data to ensure alignment between rewards and agent learning.
  • The GTPO hybrid advantage formulation effectively addresses the advantage misalignment problem, leading to improved agent performance.

Performance Improvements

When applying our approach to the Tau-Bench airline benchmark, we observed significant performance enhancements:

  • Qwen3.5-4B improved from 63.8% to 66.7% (+2.9 percentage points).
  • Qwen3-30B-A3B increased from 58.0% to 69.5% (+11.5 percentage points).
  • The trained 4B model outperformed GPT-4.1 (49.4%) and GPT-4o (42.8%) despite being 50 times smaller.
  • The 30.5B MoE model approached Claude Sonnet 4.5 (70.0%), showcasing the effectiveness of our methodology.

Conclusion and Future Work

To our knowledge, these are the first published reinforcement learning training results on the Tau-Bench benchmark. We are committed to advancing the field of tool-calling agents and have released our code, reward calibration analysis, and training recipes for further research and development.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.