Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning
Summary: arXiv:2603.17233v2
Announce Type: replace
Abstract
Auto-formalization (AF) translates natural-language reasoning problems into solver-executable programs, enabling symbolic solvers to perform sound logical deduction. In practice, however, AF pipelines are currently brittle: programs may fail to execute or execute but encode incorrect semantics. While prior work largely mitigates syntactic failures via repairs based on solver feedback, reducing semantic failures remains a major bottleneck. We propose Draft-and-Prune (D&P), an inference-time framework that improves AF-based logical reasoning via diversity and verification. D&P first drafts multiple natural-language plans and conditions program generation on them. It further prunes executable but contradictory or ambiguous formalizations, and aggregates predictions from surviving paths via majority voting.
Key Innovations of Draft-and-Prune
The Draft-and-Prune approach introduces several key innovations to enhance the reliability of auto-formalization:
- Diversity in Drafting: D&P generates multiple drafts of potential solutions, allowing for a broader exploration of possibilities.
- Pruning Mechanism: The framework identifies and eliminates executable but contradictory or ambiguous formalizations, ensuring that only the most reliable outputs are considered.
- Majority Voting: Predictions from the surviving drafts are aggregated using majority voting, which enhances the robustness of the final output.
Performance Evaluation
To assess the effectiveness of D&P, the framework was tested across four representative benchmarks: AR-LSAT, ProofWriter, PrOntoQA, and LogicalDeduction. The results indicated substantial improvements in AF-based reasoning without requiring additional supervision:
- AR-LSAT: In the AF-only setting, D&P achieved an impressive 78.43% accuracy with GPT-4 and 78.00% accuracy with GPT-4o, significantly outperforming the strongest AF baselines, MAD-LOGIC and CLOVER.
- PrOntoQA: D&P attained a remarkable 100% accuracy, showcasing its high reliability.
- LogicalDeduction: Similar to PrOntoQA, D&P also reached 100% accuracy on this benchmark.
Conclusion
The introduction of Draft-and-Prune marks a significant advancement in the field of auto-formalization for logical reasoning. By addressing both syntactic and semantic failures, D&P offers a more reliable framework for translating natural-language reasoning problems into executable programs. As the demand for accurate logical reasoning in AI applications continues to grow, the contributions of D&P could pave the way for more robust and effective AI systems capable of sound logical deduction.
