E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning
Summary: arXiv:2604.09455v1 Announce Type: new
Abstract: While Large Language Models (LLMs) have demonstrated significant potential in Tool-Integrated Reasoning (TIR), existing training paradigms face significant limitations: Zero-RL suffers from inefficient exploration and mode degradation due to a lack of prior guidance, while SFT-then-RL is limited by high data costs and capability plateaus caused by low-entropy collapse. To address these challenges, we propose E3-TIR (Enhanced Experience Exploitation), a warm-up paradigm for the early stages of agent training.
Introduction
The advent of Large Language Models has transformed numerous fields, with Tool-Integrated Reasoning (TIR) standing out as a particularly promising area. However, current training methodologies exhibit critical shortcomings that hinder optimal performance. Traditional approaches like Zero-RL and SFT-then-RL have demonstrated inefficiencies that necessitate a more refined solution.
Challenges in Existing Paradigms
Two primary challenges are evident in the existing training paradigms:
- Zero-RL: This approach suffers from inefficient exploration, leading to mode degradation. The absence of prior guidance means the model often fails to explore effectively.
- SFT-then-RL: This method incurs high data costs and experiences capability plateaus. The low-entropy collapse results in limited exploration and stunted learning.
Introducing E3-TIR
To effectively address the limitations of existing paradigms, we introduce E3-TIR. This innovative training paradigm focuses on enhanced experience exploitation during the initial stages of agent training. Our approach revolves around the dynamic integration of three distinct experience types:
- Expert Prefixes: Utilizing knowledge from experienced models to anchor learning.
- Expert Guided: Incorporating guidance from experts to refine decision-making processes.
- Self-Exploration: Encouraging the model to explore its own capabilities and limits.
Methodology
By executing diverse branching exploration around expert “anchors” and employing a mix policy optimization mechanism, E3-TIR effectively mitigates distribution shifts. This method resolves optimization conflicts that arise from shared prefixes, allowing for a more adaptable training process. The dynamic adjustment of the model’s knowledge boundaries ensures a balance between exploration diversity and training efficiency.
Experimental Results
Our experimental results highlight the effectiveness of E3-TIR in comparison to traditional paradigms. Key findings include:
- A 6% performance improvement over traditional training methodologies on tool-use tasks.
- A requirement of less than 10% synthetic data for effective training.
- A 1.46x gain in ROI, a comprehensive metric that integrates performance, data cost, and training efficiency.
Conclusion
The E3-TIR paradigm offers a significant advancement in the field of Tool-Integrated Reasoning, addressing the prevalent challenges of existing training methods. By harnessing a combination of expert knowledge and self-exploration, E3-TIR not only enhances performance but also optimizes resource usage. For those interested in exploring this methodology further, the code is available at https://github.com/yuki-younai/E3-TIR.
