A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding
Summary: arXiv:2604.19689v1 Announce Type: new
Abstract
Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowledge, limiting interpretability and explicit evidence grounding.
Introduction to A-MAR
We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. This innovative approach aims to enhance the understanding of artworks through more explicit processes.
How A-MAR Works
The A-MAR framework operates in a systematic manner, which can be broken down into the following steps:
- Task Decomposition: A-MAR begins by breaking down the user query and artwork into a structured reasoning plan.
- Goal Specification: Each step of the reasoning plan outlines specific goals that need to be achieved for comprehensive understanding.
- Evidence Requirements: A clear identification of the types of evidence needed for each step is established, ensuring targeted information retrieval.
- Evidence Selection: The retrieval process is conditioned on the reasoning plan, allowing for focused evidence selection that supports step-wise, grounded explanations.
Evaluation and Benchmarking
To evaluate agent-based multimodal reasoning within the art domain, we introduce ArtCoT-QA, a diagnostic benchmark that features multi-step reasoning chains for diverse art-related queries. This benchmark enables a granular analysis that extends beyond simple final answer accuracy, providing insights into the reasoning processes involved in artwork understanding.
Experimental Results
Experiments conducted on datasets such as SemArt and Artpedia demonstrate that A-MAR consistently outperforms static, non-planned retrieval methods and strong multimodal large language model (MLLM) baselines in final explanation quality. Additionally, evaluations on ArtCoT-QA further highlight A-MAR’s advantages in evidence grounding and multi-step reasoning ability.
Significance and Future Directions
The results from these experiments underscore the critical importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding. A-MAR represents a significant step toward the development of interpretable, goal-driven AI systems, particularly in cultural industries. The framework’s ability to provide explicit reasoning and evidence-based explanations is poised to enhance various applications in art analysis and education.
Access to Code and Data
For those interested in exploring A-MAR further, the code and data are publicly available at: https://github.com/ShuaiWang97/A-MAR.
