GeoSketch: A Neural-Symbolic Approach to Geometric Multimodal Reasoning with Auxiliary Line Construction and Affine Transformation
In the rapidly evolving field of artificial intelligence, a significant challenge remains in the realm of Geometric Problem Solving (GPS). This challenge is particularly pronounced for Multimodal Large Language Models (MLLMs), which must adeptly combine text interpretation with dynamic diagram analysis while engaging in iterative visuospatial reasoning. Traditional models have primarily treated diagrams as static images, missing a pivotal aspect of human reasoning: the ability to dynamically manipulate geometric elements through auxiliary line construction and affine transformations.
To address this gap, researchers have introduced GeoSketch, a groundbreaking neural-symbolic framework that transforms geometric reasoning into an interactive perception-reasoning-action loop. This innovative approach aims to enhance the capabilities of MLLMs in solving complex geometric problems.
Key Components of GeoSketch
- Perception Module: This component abstracts diagrams into structured logic forms, facilitating a deeper understanding of geometric relationships.
- Symbolic Reasoning Module: Utilizing geometric theorems, this module determines the next deductive step required to solve the problem, enhancing logical progression.
- Sketch Action Module: This dynamic element executes operations such as drawing auxiliary lines or applying transformations, allowing the model to update diagrams interactively.
Training Methodology
The training of the GeoSketch agent involves a comprehensive two-stage pipeline. Initially, the model undergoes supervised fine-tuning on a dataset of 2,000 symbolic-curated trajectories. This foundational training is then complemented by reinforcement learning, where dense symbolic rewards are employed to bolster the model’s robustness and strategic exploration capabilities.
Evaluation and Benchmarking
To assess the efficacy of the GeoSketch framework, a specialized evaluation metric known as the GeoSketch Benchmark has been introduced. This benchmark comprises a high-quality set of 390 geometry problems that necessitate the application of auxiliary constructions or affine transformations. The experiments conducted using strong MLLM baselines reveal that GeoSketch significantly enhances both stepwise reasoning accuracy and problem-solving success rates compared to traditional static perception methods.
Conclusion
By unifying hierarchical decision-making with executable visual actions and symbolic verification, GeoSketch marks a significant advancement in multimodal reasoning. This framework transitions geometric problem-solving from a static interpretation model to a dynamic, verifiable interaction model. As a result, GeoSketch establishes a new foundation for tackling complex visuospatial problems, paving the way for future research and applications in AI-driven geometric reasoning.
