Multi-modal Reasoning with LLMs for Visual Arithmetic

Date:

Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

Summary: arXiv:2604.19567v1 Announce Type: new

Abstract

Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy “king”-“man”+”woman” = “queen” illustrates relational reasoning, yet replacing text with images of “king” and “man” significantly reduces performance because it requires commonsense knowledge and the extraction of concise concepts from irrelevant visual details.

This capability is particularly important for service and domestic robotics in unstructured environments, where robots must infer semantic relationships among objects, agents, and actions. For instance, in a kitchen setting, recognizing from images that “powder” and “cake” are related by “is made of” grounds symbolic relations in perception. This understanding enables tool substitution, task generalization, and improved semantic reasoning.

Challenges in Visual Semantic Arithmetic

Prior work approaches semantic arithmetic by decoding image features after vector arithmetic, yet this method suffers from modality gaps and lacks systematic evaluation. The challenges faced are multifaceted:

  • Commonsense Knowledge: The ability to apply commonsense reasoning to visual data is typically lacking.
  • Concept Extraction: Extracting concise and relevant concepts from complex visual scenes is difficult.
  • Evaluation Metrics: There is a lack of standardized metrics for evaluating visual semantic arithmetic performance.

Proposed Solutions

In this paper, we formulate two novel tasks: two-term subtraction and three-term operations. To support these tasks, we construct the Image-Relation-Pair Dataset (IRPD) for benchmarking. Additionally, we propose Semantic Arithmetic Reinforcement Fine-Tuning (SAri-RFT), which post-trains large vision-language models (LVLMs) using a verifiable function and Group Relative Policy Optimization (GRPO).

Results and Implications

Our method achieves state-of-the-art results on both the IRPD and the real-world Visual7W-Telling dataset. By equipping LVLMs with robust cross-modal relational reasoning capabilities, this work significantly advances the ability of domestic robots to ground symbolic reasoning in perception. The implications of this advancement are profound:

  • Enhanced Decision-Making: Robots can make better-informed decisions based on visual inputs.
  • Improved Tool Adaptability: Robots can learn to substitute tools based on contextual understanding.
  • Better Human-Robot Interaction: Improved understanding leads to more intuitive interactions with humans in complex environments.

Conclusion

This research highlights the importance of visual semantic arithmetic in the context of robotics and artificial intelligence. Datasets and source code are provided in the supplementary material, promoting further exploration and development in this critical area of AI research.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.