FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?
Summary: arXiv:2603.26996v1 Announce Type: new
Abstract: We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean 4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker.
Introduction
Advancements in artificial intelligence have led to significant breakthroughs in various fields, including mathematics. One of the latest developments is the introduction of FormalProofBench, a benchmark specifically designed to assess the capabilities of AI models in generating and verifying complex mathematical proofs. This initiative raises an intriguing question: Can AI models produce graduate-level mathematical proofs that are formally verified?
Overview of FormalProofBench
FormalProofBench is a meticulously crafted benchmark aimed at testing AI’s prowess in formal theorem proving. The benchmark features:
- Natural-Language Problems: Each task begins with a problem presented in natural language, making it accessible to both AI models and human evaluators.
- Lean 4 Formal Statements: Each problem is paired with a formal statement in Lean 4, a proof assistant designed for formal verification.
- Output Requirements: The AI model’s task is to generate a Lean proof that passes the Lean 4 checker.
Target Audience and Content
The benchmark primarily targets advanced undergraduate and graduate students, drawing problems from:
- Qualifying exams
- Standard textbooks
- Topics such as analysis, algebra, probability, and logic
This wide-ranging content ensures that the evaluation is comprehensive, challenging AI models to navigate through diverse mathematical concepts.
Performance Evaluation
To gauge the effectiveness of various AI models, the research team conducted extensive evaluations using an agentic harness. The results were revealing:
- The best-performing foundation model achieved an accuracy of 33.5%.
- Performance rapidly declined with subsequent models, highlighting the challenges faced in this complex domain.
Analysis of Results
Beyond the accuracy metrics, the researchers provided a detailed empirical analysis encompassing:
- Tool Use: Evaluating how effectively models utilize available tools to construct proofs.
- Failure Modes: Identifying common pitfalls and areas where models struggle.
- Cost and Latency: Assessing the computational resources required and the time taken to generate proofs.
This multifaceted evaluation offers valuable insights into the formal-theorem proving capabilities of frontier AI models, paving the way for future improvements and innovations.
Conclusion
FormalProofBench stands as a significant contribution to the ongoing exploration of AI in the realm of mathematics. By rigorously evaluating the formal proof generation capabilities of AI models, this benchmark not only highlights current limitations but also sets the stage for future advancements in the integration of AI in mathematical reasoning.
