Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
Summary: arXiv:2604.19567v1 Announce Type: new
Abstract
Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy “king”-“man”+”woman” = “queen” illustrates relational reasoning, yet replacing text with images of “king” and “man” significantly reduces performance because it requires commonsense knowledge and the extraction of concise concepts from irrelevant visual details.
This capability is particularly important for service and domestic robotics in unstructured environments, where robots must infer semantic relationships among objects, agents, and actions. For instance, in a kitchen setting, recognizing from images that “powder” and “cake” are related by “is made of” grounds symbolic relations in perception. This understanding enables tool substitution, task generalization, and improved semantic reasoning.
Challenges in Visual Semantic Arithmetic
Prior work approaches semantic arithmetic by decoding image features after vector arithmetic, yet this method suffers from modality gaps and lacks systematic evaluation. The challenges faced are multifaceted:
- Commonsense Knowledge: The ability to apply commonsense reasoning to visual data is typically lacking.
- Concept Extraction: Extracting concise and relevant concepts from complex visual scenes is difficult.
- Evaluation Metrics: There is a lack of standardized metrics for evaluating visual semantic arithmetic performance.
Proposed Solutions
In this paper, we formulate two novel tasks: two-term subtraction and three-term operations. To support these tasks, we construct the Image-Relation-Pair Dataset (IRPD) for benchmarking. Additionally, we propose Semantic Arithmetic Reinforcement Fine-Tuning (SAri-RFT), which post-trains large vision-language models (LVLMs) using a verifiable function and Group Relative Policy Optimization (GRPO).
Results and Implications
Our method achieves state-of-the-art results on both the IRPD and the real-world Visual7W-Telling dataset. By equipping LVLMs with robust cross-modal relational reasoning capabilities, this work significantly advances the ability of domestic robots to ground symbolic reasoning in perception. The implications of this advancement are profound:
- Enhanced Decision-Making: Robots can make better-informed decisions based on visual inputs.
- Improved Tool Adaptability: Robots can learn to substitute tools based on contextual understanding.
- Better Human-Robot Interaction: Improved understanding leads to more intuitive interactions with humans in complex environments.
Conclusion
This research highlights the importance of visual semantic arithmetic in the context of robotics and artificial intelligence. Datasets and source code are provided in the supplementary material, promoting further exploration and development in this critical area of AI research.
