ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation
Summary: arXiv:2603.29902v1 Announce Type: new
Abstract
Interleaved text-and-image generation represents a significant frontier for Multimodal Large Language Models (MLLMs), offering a more intuitive way to convey complex information. Current paradigms rely on either image generation or retrieval augmentation, yet they typically treat the two as mutually exclusive paths, failing to unify factuality with creativity. We argue that the next milestone in this field is Agentic Tool Planning, where the model serves as a central controller that autonomously determines when, where, and which tools to invoke to produce interleaved responses for visual-critical queries.
Introduction
The evolution of MLLMs has paved the way for new methods of generating content that integrates both text and imagery seamlessly. However, existing approaches often segregate the processes of image generation and information retrieval, resulting in a lack of coherence. The ATP-Bench framework aims to bridge this gap by introducing a structured way for these models to plan and execute interleaved generation tasks effectively.
ATP-Bench Overview
To systematically evaluate the Agentic Tool Planning paradigm, we introduce ATP-Bench, a novel benchmark comprising:
- 7,702 QA pairs
- 1,592 Visual Question Answering (VQA) pairs
- Eight categories
- 25 visual-critical intents
This dataset features human-verified queries and ground truths, ensuring reliability and accuracy in evaluation.
Multi-Agent MLLM-as-a-Judge (MAM) System
In addition to ATP-Bench, we propose a Multi-Agent MLLM-as-a-Judge (MAM) system to evaluate agentic planning independent of end-to-end execution and varying tool backends. The MAM system allows for:
- Tool-call precision evaluation
- Identification of missed opportunities for tool use
- Assessment of overall response quality without necessitating ground-truth references
Experimental Results
Our extensive experiments conducted on 10 state-of-the-art MLLMs reveal that these models struggle with coherent interleaved planning. Notably, we observed significant variations in tool-use behavior among the models, indicating substantial room for improvement. These findings provide actionable insights for the advancement of interleaved generation techniques.
Conclusion
The introduction of ATP-Bench and the MAM system marks a pivotal step forward in the development of MLLMs capable of agentic tool planning. The ability to effectively interleave text and image generation not only enhances the quality of responses but also enriches user interaction by fostering a more natural flow of information. Researchers and developers are encouraged to explore the dataset and code available at https://github.com/Qwen-Applications/ATP-Bench to further this promising field.
