Draft-and-Prune: Boost Auto-Formalization Accuracy in Logic

Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning

Summary: arXiv:2603.17233v2

Announce Type: replace

Abstract

Auto-formalization (AF) translates natural-language reasoning problems into solver-executable programs, enabling symbolic solvers to perform sound logical deduction. In practice, however, AF pipelines are currently brittle: programs may fail to execute or execute but encode incorrect semantics. While prior work largely mitigates syntactic failures via repairs based on solver feedback, reducing semantic failures remains a major bottleneck. We propose Draft-and-Prune (D&P), an inference-time framework that improves AF-based logical reasoning via diversity and verification. D&P first drafts multiple natural-language plans and conditions program generation on them. It further prunes executable but contradictory or ambiguous formalizations, and aggregates predictions from surviving paths via majority voting.

Key Innovations of Draft-and-Prune

The Draft-and-Prune approach introduces several key innovations to enhance the reliability of auto-formalization:

Diversity in Drafting: D&P generates multiple drafts of potential solutions, allowing for a broader exploration of possibilities.
Pruning Mechanism: The framework identifies and eliminates executable but contradictory or ambiguous formalizations, ensuring that only the most reliable outputs are considered.
Majority Voting: Predictions from the surviving drafts are aggregated using majority voting, which enhances the robustness of the final output.

Performance Evaluation

To assess the effectiveness of D&P, the framework was tested across four representative benchmarks: AR-LSAT, ProofWriter, PrOntoQA, and LogicalDeduction. The results indicated substantial improvements in AF-based reasoning without requiring additional supervision:

AR-LSAT: In the AF-only setting, D&P achieved an impressive 78.43% accuracy with GPT-4 and 78.00% accuracy with GPT-4o, significantly outperforming the strongest AF baselines, MAD-LOGIC and CLOVER.
PrOntoQA: D&P attained a remarkable 100% accuracy, showcasing its high reliability.
LogicalDeduction: Similar to PrOntoQA, D&P also reached 100% accuracy on this benchmark.

Conclusion

The introduction of Draft-and-Prune marks a significant advancement in the field of auto-formalization for logical reasoning. By addressing both syntactic and semantic failures, D&P offers a more reliable framework for translating natural-language reasoning problems into executable programs. As the demand for accurate logical reasoning in AI applications continues to grow, the contributions of D&P could pave the way for more robust and effective AI systems capable of sound logical deduction.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Draft-and-Prune: Boost Auto-Formalization Accuracy in Logic

Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning

Abstract

Key Innovations of Draft-and-Prune

Performance Evaluation

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related