Asking like Socrates: Socrates helps VLMs understand remote sensing images
Summary: arXiv:2511.22396v2 Announce Type: replace-cross
Abstract
Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence.
Introduction
In the evolving field of artificial intelligence, the integration of visual and linguistic modalities has led to significant breakthroughs. However, the challenges posed by remote sensing data require innovative solutions. The Glance Effect highlights a fundamental limitation in current models, often leading to erroneous conclusions based on superficial analysis rather than deep visual understanding.
Proposed Solution: RS-EoT
To address this limitation, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. This approach emphasizes the necessity of thorough visual exploration to support accurate reasoning in remote sensing tasks.
SocraticAgent: A Multi-Agent System
Central to our proposal is the SocraticAgent, a self-play multi-agent system. This system synthesizes reasoning traces through alternating cycles of reasoning and visual inspection. The iterative nature of SocraticAgent fosters a deeper engagement with visual data, moving beyond mere narrative to authentic reasoning.
Progressive Reinforcement Learning Strategy
To further enhance the capabilities of RS-EoT, we introduce a two-stage progressive reinforcement learning (RL) strategy:
- Stage One: Reinforcement Learning on fine-grained Grounding tasks to enhance RS-EoT capabilities.
- Stage Two: Reinforcement Learning on RS Visual Question Answering (VQA) tasks to generalize understanding across broader scenarios.
Experimental Results
Our experiments demonstrate that RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. The results indicate that the proposed methodology effectively mitigates the Glance Effect, enabling genuine evidence-grounded reasoning.
Conclusion
In conclusion, the integration of Socratic reasoning into vision-language models presents a promising avenue for enhancing the interpretation of remote sensing images. By fostering a deeper connection between visual evidence and reasoning processes, our approach paves the way for more accurate and reliable AI systems in this domain.
Availability
For those interested in further exploring our work, we have made our code, data, and models available at https://geox-lab.github.io/Asking_like_Socrates.
