ATP-Bench: Benchmark for Agentic Tool Planning in MLLMs

ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation

Summary: arXiv:2603.29902v1 Announce Type: new

Abstract

Interleaved text-and-image generation represents a significant frontier for Multimodal Large Language Models (MLLMs), offering a more intuitive way to convey complex information. Current paradigms rely on either image generation or retrieval augmentation, yet they typically treat the two as mutually exclusive paths, failing to unify factuality with creativity. We argue that the next milestone in this field is Agentic Tool Planning, where the model serves as a central controller that autonomously determines when, where, and which tools to invoke to produce interleaved responses for visual-critical queries.

Introduction

The evolution of MLLMs has paved the way for new methods of generating content that integrates both text and imagery seamlessly. However, existing approaches often segregate the processes of image generation and information retrieval, resulting in a lack of coherence. The ATP-Bench framework aims to bridge this gap by introducing a structured way for these models to plan and execute interleaved generation tasks effectively.

ATP-Bench Overview

To systematically evaluate the Agentic Tool Planning paradigm, we introduce ATP-Bench, a novel benchmark comprising:

7,702 QA pairs
1,592 Visual Question Answering (VQA) pairs
Eight categories
25 visual-critical intents

This dataset features human-verified queries and ground truths, ensuring reliability and accuracy in evaluation.

Multi-Agent MLLM-as-a-Judge (MAM) System

In addition to ATP-Bench, we propose a Multi-Agent MLLM-as-a-Judge (MAM) system to evaluate agentic planning independent of end-to-end execution and varying tool backends. The MAM system allows for:

Tool-call precision evaluation
Identification of missed opportunities for tool use
Assessment of overall response quality without necessitating ground-truth references

Experimental Results

Our extensive experiments conducted on 10 state-of-the-art MLLMs reveal that these models struggle with coherent interleaved planning. Notably, we observed significant variations in tool-use behavior among the models, indicating substantial room for improvement. These findings provide actionable insights for the advancement of interleaved generation techniques.

Conclusion

The introduction of ATP-Bench and the MAM system marks a pivotal step forward in the development of MLLMs capable of agentic tool planning. The ability to effectively interleave text and image generation not only enhances the quality of responses but also enriches user interaction by fostering a more natural flow of information. Researchers and developers are encouraged to explore the dataset and code available at https://github.com/Qwen-Applications/ATP-Bench to further this promising field.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

ATP-Bench: Benchmark for Agentic Tool Planning in MLLMs

ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation

Abstract

Introduction

ATP-Bench Overview

Multi-Agent MLLM-as-a-Judge (MAM) System

Experimental Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related