Evaluating LLM-Generated ACSL Annotations for Formal Verification
Summary: arXiv:2602.13851v2 Announce Type: replace-cross
Abstract: Formal specifications are crucial for building verifiable and dependable software systems, yet generating accurate and verifiable specifications for real-world C programs remains challenging. This paper empirically evaluates the extent to which formal-analysis tools can automatically generate and verify ACSL specifications without human or learning-based assistance.
Introduction
The demand for high-quality software systems has never been greater, particularly in safety-critical domains such as healthcare, finance, and transportation. As software complexity continues to rise, so does the need for rigorous methods to ensure its reliability. Formal specifications serve as a foundation for building verifiable software systems, yet the task of generating these specifications, especially for real-world C programs, presents significant challenges.
Methodology
This paper presents a controlled study aimed at evaluating the performance of various tools in generating ACSL (ANSI/ISO C Specification Language) annotations. We utilized a recently released dataset containing 506 C programs, transitioning from interactive, developer-driven workflows to an automated evaluation setting.
- Five ACSL generation systems were analyzed:
- A rule-based Python script
- Frama-C’s RTE plugin
- DeepSeek-V3.2, a large language model
- GPT-5.2, another prominent model
- OLMo 3.1 32B Instruct, a third language model
Evaluation Process
All generated ACSL specifications were verified under controlled conditions using the Frama-C WP plugin, which is powered by multiple SMT (Satisfiability Modulo Theories) solvers. This setup allowed for a direct comparison of several factors:
- Annotation Quality: Assessing the correctness and completeness of generated specifications.
- Solver Sensitivity: Evaluating how different solvers reacted to the generated annotations.
- Proof Stability: Analyzing the consistency of verification results across multiple runs.
Results
The findings from this study provide new empirical evidence on the capabilities and limitations of automated ACSL generation systems. While some models demonstrated promising results, others struggled to produce accurate annotations. The study highlights the importance of understanding the trade-offs between automated generation methods and human expertise in software verification.
Conclusion
This research contributes to the growing body of literature on formal verification and automated specification generation. By empirically evaluating the performance of various tools, we aim to enhance the understanding of their effectiveness and limitations. The insights gleaned from this study will be invaluable for researchers and practitioners seeking to improve the reliability of software systems through effective specification generation.
Future Work
Further research is needed to optimize the performance of automated ACSL generation systems. Exploring hybrid approaches that combine human expertise with machine-generated annotations could yield better results and pave the way for more robust software verification processes.
