Automated ACSL Annotation Evaluation for Formal Verification

Evaluating LLM-Generated ACSL Annotations for Formal Verification

Summary: arXiv:2602.13851v2 Announce Type: replace-cross

Abstract: Formal specifications are crucial for building verifiable and dependable software systems, yet generating accurate and verifiable specifications for real-world C programs remains challenging. This paper empirically evaluates the extent to which formal-analysis tools can automatically generate and verify ACSL specifications without human or learning-based assistance.

Introduction

The demand for high-quality software systems has never been greater, particularly in safety-critical domains such as healthcare, finance, and transportation. As software complexity continues to rise, so does the need for rigorous methods to ensure its reliability. Formal specifications serve as a foundation for building verifiable software systems, yet the task of generating these specifications, especially for real-world C programs, presents significant challenges.

Methodology

This paper presents a controlled study aimed at evaluating the performance of various tools in generating ACSL (ANSI/ISO C Specification Language) annotations. We utilized a recently released dataset containing 506 C programs, transitioning from interactive, developer-driven workflows to an automated evaluation setting.

Five ACSL generation systems were analyzed:

A rule-based Python script
Frama-C’s RTE plugin
DeepSeek-V3.2, a large language model
GPT-5.2, another prominent model
OLMo 3.1 32B Instruct, a third language model

Evaluation Process

All generated ACSL specifications were verified under controlled conditions using the Frama-C WP plugin, which is powered by multiple SMT (Satisfiability Modulo Theories) solvers. This setup allowed for a direct comparison of several factors:

Annotation Quality: Assessing the correctness and completeness of generated specifications.
Solver Sensitivity: Evaluating how different solvers reacted to the generated annotations.
Proof Stability: Analyzing the consistency of verification results across multiple runs.

Results

The findings from this study provide new empirical evidence on the capabilities and limitations of automated ACSL generation systems. While some models demonstrated promising results, others struggled to produce accurate annotations. The study highlights the importance of understanding the trade-offs between automated generation methods and human expertise in software verification.

Conclusion

This research contributes to the growing body of literature on formal verification and automated specification generation. By empirically evaluating the performance of various tools, we aim to enhance the understanding of their effectiveness and limitations. The insights gleaned from this study will be invaluable for researchers and practitioners seeking to improve the reliability of software systems through effective specification generation.

Future Work

Further research is needed to optimize the performance of automated ACSL generation systems. Exploring hybrid approaches that combine human expertise with machine-generated annotations could yield better results and pave the way for more robust software verification processes.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Automated ACSL Annotation Evaluation for Formal Verification

Evaluating LLM-Generated ACSL Annotations for Formal Verification

Introduction

Methodology

Evaluation Process

Results

Conclusion

Future Work

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related