Beyond Symbolic Solving: Multi Chain-of-Thought Voting for Geometric Reasoning in Large Language Models
Summary: arXiv:2604.00890v1 Announce Type: new
Abstract
Geometric Problem Solving (GPS) remains at the heart of enhancing mathematical reasoning in large language models because it requires the combination of diagrammatic understanding, symbolic manipulation, and logical inference. In existing literature, researchers have chiefly focused on synchronizing the diagram descriptions with text literals and solving the problem. In this vein, they have either taken a neural, symbolic, or neuro-symbolic approach. But this solves only the first two of the requirements, namely diagrammatic understanding and symbolic manipulation, while leaving logical inference underdeveloped.
The logical inference is often limited to one chain-of-thought (CoT). To address this weakness in hitherto existing models, this paper proposes MARS-GPS, which generates multiple parallel reasoning rollouts augmented with Python code execution for numerical verification. It ranks them using token-level entropy as a confidence signal and aggregates answers through a multi-stage voting and self-verification pipeline.
Key Contributions of MARS-GPS
- Multiple Reasoning Rollouts: The model generates up to 16 parallel reasoning paths, which enhances the depth and breadth of logical inference.
- Numerical Verification: Each reasoning rollout is augmented with Python code execution that serves to verify numerical solutions, increasing reliability.
- Confidence Ranking: The use of token-level entropy allows for a more nuanced confidence measure in the generated answers.
- Multi-Stage Voting: Answers are aggregated through a sophisticated voting mechanism that improves accuracy and consistency of results.
Empirical Results
Empirical results show that MARS-GPS with eight parallel rollouts achieves an impressive accuracy of 88.8% on the Geometry3K benchmark. This represents a nearly 11% improvement over the previous state-of-the-art models. Moreover, the accuracy of MARS-GPS scales consistently as the number of rollouts increases; for instance, an increase from one to sixteen rollouts results in a 6.0% improvement on the ablation subset.
Conclusion
The advancements presented in MARS-GPS showcase a significant leap forward in the field of geometric problem solving within large language models. By effectively integrating multiple reasoning paths and improving logical inference through a robust verification process, MARS-GPS addresses critical gaps in existing methods. Researchers and practitioners are encouraged to explore these findings further, as the code and data are available in an anonymous repository: MARS-GPS Repository.
Future Work
Looking ahead, the authors suggest that further research should focus on:
- Expanding the range of geometric problems tackled by MARS-GPS.
- Investigating the application of the multi-chain-of-thought approach in other domains of mathematical reasoning.
- Enhancing the efficiency of the multi-stage voting mechanism to accommodate even larger datasets.
