Formal Verification Reveals AI Code Security Flaws

Date:

Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code

Summary: arXiv:2604.05292v1 Announce Type: cross

Abstract

AI coding assistants are increasingly utilized to generate production code, particularly in security-sensitive domains. However, the extent to which the code generated by these assistants is vulnerable to exploitation remains largely unquantified. This study, titled Broken by Default, seeks to fill this critical gap by conducting a formal verification analysis of 3,500 code artifacts produced by seven advanced large language models (LLMs) across 500 security-critical prompts. These prompts span five Common Weakness Enumeration (CWE) categories, with 100 prompts allocated to each category.

Key Findings

Each generated artifact is rigorously analyzed using the Z3 Satisfiability Modulo Theories (SMT) solver through the COBALT analysis pipeline. This method produces mathematical satisfiability witnesses instead of relying on pattern-based heuristics. The findings reveal significant vulnerabilities across all examined models:

  • 55.8% of the generated artifacts contain at least one vulnerability identified by COBALT.
  • Out of these, 1,055 vulnerabilities are formally proven through Z3 satisfiability witnesses.
  • Among the models analyzed, GPT-4o exhibits the highest vulnerability rate at 62.4%, resulting in a grade of F.
  • Gemini 2.5 Flash performs relatively better, with a vulnerability rate of 48.4%, earning a grade of D.
  • No model achieved a grade higher than D.

Runtime Crashes and Additional Experiments

Six out of seven representative findings were confirmed by runtime crashes detected under the GCC AddressSanitizer. The study also includes three auxiliary experiments that provide further insights:

  • Explicit Security Instructions: When specific security instructions were provided, the mean vulnerability rate decreased by only 4 percentage points.
  • Industry Tool Performance: A combination of six industry-standard tools failed to detect 97.8% of the vulnerabilities proven by Z3.
  • Model Self-Identification: In review mode, the models recognized 78.7% of their own vulnerable outputs, yet they generated these vulnerabilities at a rate of 55.8% by default.

Conclusion

The findings of this study underscore the pressing need for enhanced verification processes and security measures when utilizing AI-generated code in sensitive applications. As AI coding assistants continue to evolve, understanding their limitations and the inherent vulnerabilities in their outputs is crucial for mitigating potential security risks.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.