FormalRewardBench: Benchmark for Theorem Proving Rewards

FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

In a groundbreaking development for the field of formal theorem proving, researchers have introduced FormalRewardBench, a novel benchmark aimed at evaluating reward models utilized in neural theorem provers. This initiative was announced in the recent arXiv paper (arXiv:2605.10141v1) and seeks to address significant challenges in the realm of reinforcement learning with verifiable rewards (RLVR).

Current methods in theorem proving often rely on binary correctness signals provided by proof assistants. Although these verifiable rewards are both affordable and scalable, they present a notable drawback: sparse credit assignment. In scenarios involving more complex problems, models receive no feedback for partial progress, thereby limiting their learning potential. To mitigate this issue, the development of learned reward models has emerged as a promising alternative, allowing for the evaluation of proof quality beyond mere binary verification.

Introducing FormalRewardBench

Recognizing the difficulties in comparing reward models—often necessitating costly RL training ablations—the researchers have developed FormalRewardBench. This benchmark is the first of its kind specifically designed for assessing reward models in formal theorem proving using Lean 4.

Benchmark Composition

FormalRewardBench comprises 250 preference pairs, each featuring correct proofs paired with incorrect variants. These incorrect variants are generated through five expert-curated error injection strategies:

Forced Mistakes: Deliberately introduced errors in the proof.
Minimal Single-Point Variations: Slight alterations to the correct proof.
Verbose Incorrect Proofs: Lengthy but incorrect representations of the proof.
Natural Language Justification: Incorrect interpretations explained in natural language.
Python Code Injection: Incorporating erroneous Python code into the proof context.

This diverse set of error injection strategies aims to challenge and enhance the robustness of reward models, fostering deeper insights into proof evaluation capabilities.

Evaluation of Models

The benchmark was utilized to evaluate various language models (LLMs) including:

Frontier LLMs: Such as Claude Opus 4.5.
Judge LLMs: Exemplified by CompassJudger-1-14B.
General-Purpose LLMs: Including Qwen2.5-72B-Instruct.
Specialized Theorem Proving Models: Notably DeepSeek-Prover-V2-7B.

Initial results indicate that frontier LLMs achieved the highest performance at 59.8%, while specialized theorem provers lagged significantly, obtaining only 24.4%. This disparity suggests that proficiency in theorem proving does not necessarily translate to effective proof evaluation, highlighting an area ripe for further exploration.

Encouraging Future Research

By publicly releasing FormalRewardBench, the researchers aim to stimulate additional research focused on developing more sophisticated reward models in the domain of formal mathematics. This benchmark represents a significant step forward in understanding and improving the intersection of AI and formal theorem proving.

In conclusion, FormalRewardBench not only serves as a critical tool for evaluating reward models but also sets the stage for future innovations in the field, paving the way for advancements in artificial intelligence’s ability to reason and prove theorems effectively.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

FormalRewardBench: Benchmark for Theorem Proving Rewards

FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

Introducing FormalRewardBench

Benchmark Composition

Evaluation of Models

Encouraging Future Research

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related