Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
Summary: arXiv:2604.09813v1 Announce Type: new
Abstract
Existing synthetic tool-use corpora are primarily designed for offline supervised fine-tuning, yet reinforcement learning (RL) requires executable environments that support reward-checkable online rollouts. We propose COVERT, a two-stage pipeline that first generates reliable base tool-use trajectories through self-evolving synthesis with multi-level validation, and then applies oracle-preserving augmentations that systematically increase environmental complexity.
Introduction
The field of reinforcement learning has seen significant advancements, particularly in the domain of synthetic data generation for training intelligent agents. Traditional approaches often rely on static datasets that do not account for the dynamic nature of real-world environments. This limitation necessitates a new approach to data synthesis that can provide both controllability and verifiability in tool-use scenarios.
The COVERT Approach
Our proposed method, COVERT, encompasses a dual-stage process that enhances tool-use data generation. The stages are as follows:
- Stage One: Base Trajectory Generation
This initial stage focuses on creating reliable tool-use trajectories. The generation process employs self-evolving synthesis, which adapts and improves over time through multi-level validation techniques. This ensures that the generated data is robust and accurate.
- Stage Two: Oracle-Preserving Augmentations
In this stage, we introduce several augmentations to the existing trajectories. This includes:
- Distractor tools that challenge the agent’s decision-making.
- Indirect or ambiguous user queries that test the agent’s comprehension abilities.
- Noisy, multi-format, or erroneous tool outputs to simulate real-world unpredictability.
Importantly, these augmentations maintain the integrity of the oracle tool calls and the final answers, which serve as the ground truth for training purposes.
Benefits of COVERT
The implementation of COVERT allows for automatic reward computation through reference matching, which is essential for standard cases. Additionally, it introduces lightweight judge-assisted verification mechanisms for exceptional behaviors, such as error detection. This capability supports the reinforcement learning optimization of tool-calling policies, leading to improved agent performance.
Results
In experiments conducted using the Qwen2.5-Instruct-14B model, COVERT-RL demonstrated significant improvements in accuracy across various benchmarks:
- BFCL v3 accuracy increased from 56.5 to 59.9.
- ACEBench accuracy improved from 53.0 to 59.3.
Moreover, when combined with supervised fine-tuning (SFT), the model achieved a further accuracy of 62.1 and 61.8, indicating substantial additive gains.
Conclusion
The results from our experiments suggest that oracle-preserving synthetic environments provide a viable refinement stage for reinforcement learning, complementing traditional supervised fine-tuning methods. COVERT proves to be a powerful tool for enhancing the robustness of agentic tool-use under conditions of ambiguity and unreliable feedback.
