CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs
In recent years, the field of artificial intelligence has seen significant advancements, particularly in the development of multimodal large language models (MLLMs). However, a critical aspect of human cognition—analogical reasoning—has not been adequately addressed in existing evaluations of these models. In an effort to bridge this gap, researchers have introduced CARV (Compositional Analogical Reasoning in Vision), a pioneering task designed to enhance our understanding of MLLMs’ capabilities in this area.
Understanding Analogical Reasoning
Analogical reasoning involves mapping relationships between pairs of objects, a skill that is fundamental to human thought processes. Traditional evaluations of MLLMs often focus on single-pair analogies, neglecting the more complex task of composing rules from multiple pairs. This oversight is significant, as the ability to synthesize information from various sources is a hallmark of higher-order intelligence.
Introducing CARV
The CARV benchmark introduces a novel dataset consisting of 5,500 samples that challenge MLLMs to extend analogies from single pairs to multiple pairs. This task requires models to extract symbolic rules from each pair, which they then must compose into new transformations. The benchmark aims to provide a more comprehensive evaluation of MLLMs’ analogical reasoning capabilities, particularly in their ability to process and integrate complex information.
Evaluation and Findings
Upon evaluating state-of-the-art MLLMs, including Gemini-2.5 Pro, researchers observed a remarkable performance gap. Gemini-2.5 Pro achieved only 40.4% accuracy in completing the CARV tasks, which is significantly lower than the human-level performance benchmark of 100%. This disparity highlights the current limitations of MLLMs in analogical reasoning tasks.
Identifying Failure Modes
Diagnostic analyses of the performance of these models revealed two consistent failure modes that impede their ability to perform well on the CARV benchmark:
- Decomposing Visual Changes into Symbolic Rules: MLLMs struggled to effectively translate visual information into symbolic representations, which is essential for performing analogical reasoning.
- Maintaining Robustness Under Diverse or Complex Settings: The models often faltered when faced with varied or intricate scenarios, indicating a lack of adaptability in their reasoning processes.
Conclusion
The introduction of CARV marks a significant step forward in the evaluation of analogical reasoning in multimodal large language models. By addressing the need for a more complex and comprehensive benchmark, this research opens the door for further exploration into the cognitive capabilities of AI systems. As the field continues to evolve, understanding and enhancing the analogical reasoning abilities of MLLMs will be crucial for advancing artificial intelligence to more closely mirror human thought processes.
