Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code
Summary: arXiv:2604.05292v1 Announce Type: cross
Abstract
AI coding assistants are increasingly utilized to generate production code, particularly in security-sensitive domains. However, the extent to which the code generated by these assistants is vulnerable to exploitation remains largely unquantified. This study, titled Broken by Default, seeks to fill this critical gap by conducting a formal verification analysis of 3,500 code artifacts produced by seven advanced large language models (LLMs) across 500 security-critical prompts. These prompts span five Common Weakness Enumeration (CWE) categories, with 100 prompts allocated to each category.
Key Findings
Each generated artifact is rigorously analyzed using the Z3 Satisfiability Modulo Theories (SMT) solver through the COBALT analysis pipeline. This method produces mathematical satisfiability witnesses instead of relying on pattern-based heuristics. The findings reveal significant vulnerabilities across all examined models:
- 55.8% of the generated artifacts contain at least one vulnerability identified by COBALT.
- Out of these, 1,055 vulnerabilities are formally proven through Z3 satisfiability witnesses.
- Among the models analyzed, GPT-4o exhibits the highest vulnerability rate at 62.4%, resulting in a grade of F.
- Gemini 2.5 Flash performs relatively better, with a vulnerability rate of 48.4%, earning a grade of D.
- No model achieved a grade higher than D.
Runtime Crashes and Additional Experiments
Six out of seven representative findings were confirmed by runtime crashes detected under the GCC AddressSanitizer. The study also includes three auxiliary experiments that provide further insights:
- Explicit Security Instructions: When specific security instructions were provided, the mean vulnerability rate decreased by only 4 percentage points.
- Industry Tool Performance: A combination of six industry-standard tools failed to detect 97.8% of the vulnerabilities proven by Z3.
- Model Self-Identification: In review mode, the models recognized 78.7% of their own vulnerable outputs, yet they generated these vulnerabilities at a rate of 55.8% by default.
Conclusion
The findings of this study underscore the pressing need for enhanced verification processes and security measures when utilizing AI-generated code in sensitive applications. As AI coding assistants continue to evolve, understanding their limitations and the inherent vulnerabilities in their outputs is crucial for mitigating potential security risks.
