Draft-and-Prune: Boost Auto-Formalization Accuracy in Logic

Date:

Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning

Summary: arXiv:2603.17233v2

Announce Type: replace

Abstract

Auto-formalization (AF) translates natural-language reasoning problems into solver-executable programs, enabling symbolic solvers to perform sound logical deduction. In practice, however, AF pipelines are currently brittle: programs may fail to execute or execute but encode incorrect semantics. While prior work largely mitigates syntactic failures via repairs based on solver feedback, reducing semantic failures remains a major bottleneck. We propose Draft-and-Prune (D&P), an inference-time framework that improves AF-based logical reasoning via diversity and verification. D&P first drafts multiple natural-language plans and conditions program generation on them. It further prunes executable but contradictory or ambiguous formalizations, and aggregates predictions from surviving paths via majority voting.

Key Innovations of Draft-and-Prune

The Draft-and-Prune approach introduces several key innovations to enhance the reliability of auto-formalization:

  • Diversity in Drafting: D&P generates multiple drafts of potential solutions, allowing for a broader exploration of possibilities.
  • Pruning Mechanism: The framework identifies and eliminates executable but contradictory or ambiguous formalizations, ensuring that only the most reliable outputs are considered.
  • Majority Voting: Predictions from the surviving drafts are aggregated using majority voting, which enhances the robustness of the final output.

Performance Evaluation

To assess the effectiveness of D&P, the framework was tested across four representative benchmarks: AR-LSAT, ProofWriter, PrOntoQA, and LogicalDeduction. The results indicated substantial improvements in AF-based reasoning without requiring additional supervision:

  • AR-LSAT: In the AF-only setting, D&P achieved an impressive 78.43% accuracy with GPT-4 and 78.00% accuracy with GPT-4o, significantly outperforming the strongest AF baselines, MAD-LOGIC and CLOVER.
  • PrOntoQA: D&P attained a remarkable 100% accuracy, showcasing its high reliability.
  • LogicalDeduction: Similar to PrOntoQA, D&P also reached 100% accuracy on this benchmark.

Conclusion

The introduction of Draft-and-Prune marks a significant advancement in the field of auto-formalization for logical reasoning. By addressing both syntactic and semantic failures, D&P offers a more reliable framework for translating natural-language reasoning problems into executable programs. As the demand for accurate logical reasoning in AI applications continues to grow, the contributions of D&P could pave the way for more robust and effective AI systems capable of sound logical deduction.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.