ATP-Bench: Benchmark for Agentic Tool Planning in MLLMs

Date:

ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation

Summary: arXiv:2603.29902v1 Announce Type: new

Abstract

Interleaved text-and-image generation represents a significant frontier for Multimodal Large Language Models (MLLMs), offering a more intuitive way to convey complex information. Current paradigms rely on either image generation or retrieval augmentation, yet they typically treat the two as mutually exclusive paths, failing to unify factuality with creativity. We argue that the next milestone in this field is Agentic Tool Planning, where the model serves as a central controller that autonomously determines when, where, and which tools to invoke to produce interleaved responses for visual-critical queries.

Introduction

The evolution of MLLMs has paved the way for new methods of generating content that integrates both text and imagery seamlessly. However, existing approaches often segregate the processes of image generation and information retrieval, resulting in a lack of coherence. The ATP-Bench framework aims to bridge this gap by introducing a structured way for these models to plan and execute interleaved generation tasks effectively.

ATP-Bench Overview

To systematically evaluate the Agentic Tool Planning paradigm, we introduce ATP-Bench, a novel benchmark comprising:

  • 7,702 QA pairs
  • 1,592 Visual Question Answering (VQA) pairs
  • Eight categories
  • 25 visual-critical intents

This dataset features human-verified queries and ground truths, ensuring reliability and accuracy in evaluation.

Multi-Agent MLLM-as-a-Judge (MAM) System

In addition to ATP-Bench, we propose a Multi-Agent MLLM-as-a-Judge (MAM) system to evaluate agentic planning independent of end-to-end execution and varying tool backends. The MAM system allows for:

  • Tool-call precision evaluation
  • Identification of missed opportunities for tool use
  • Assessment of overall response quality without necessitating ground-truth references

Experimental Results

Our extensive experiments conducted on 10 state-of-the-art MLLMs reveal that these models struggle with coherent interleaved planning. Notably, we observed significant variations in tool-use behavior among the models, indicating substantial room for improvement. These findings provide actionable insights for the advancement of interleaved generation techniques.

Conclusion

The introduction of ATP-Bench and the MAM system marks a pivotal step forward in the development of MLLMs capable of agentic tool planning. The ability to effectively interleave text and image generation not only enhances the quality of responses but also enriches user interaction by fostering a more natural flow of information. Researchers and developers are encouraged to explore the dataset and code available at https://github.com/Qwen-Applications/ATP-Bench to further this promising field.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.