Delulu: Multi-Lingual Benchmark for Detecting Code Hallucinations

Date:

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

In the realm of artificial intelligence, particularly in the development of Large Language Models (LLMs) for code generation, a pressing concern has emerged: the phenomenon of hallucinations. These are instances where the models generate plausible yet incorrect completions during Fill-in-the-Middle (FIM) tasks. Such errors can include invented API methods, invalid parameters, undefined variables, or non-existent imports, leading to runtime errors that can be detrimental to software development.

Addressing this critical issue, researchers have introduced Delulu, a verified multi-lingual benchmark designed specifically to detect code hallucinations in FIM tasks. This innovative benchmark comprises 1,951 FIM samples sourced from seven programming languages and categorized into four distinct hallucination types. The samples were meticulously curated through an adversarial pipeline that ensures a comprehensive evaluation of code generation accuracy.

The Curated Process Behind Delulu

The development of Delulu involved a multi-step process aimed at enhancing the quality and reliability of the benchmark:

  • Generation of Plausible Hallucinations: A frontier LLM was tasked with generating plausible but incorrect code completions.
  • Evaluation by Diverse Judge Models: Four different judge models assessed the generated hallucinations, ensuring a varied perspective on the validity of the outputs.
  • Mining Harder Examples: Embedding-based clustering techniques were utilized to identify progressively challenging examples, further refining the benchmark.
  • Verification through Docker Containers: Self-contained Docker containers were employed to verify that the golden completions compiled correctly while the hallucinated variants consistently produced the expected runtime errors.
  • Final Human-Expert Review: A thorough review by human experts was conducted to eliminate any biased or trivially decidable samples, ensuring the integrity of the benchmark.

Evaluation of Open-Weight FIM Models

Delulu has been utilized to evaluate 11 open-weight FIM models from five different families, with parameter sizes ranging from 0.5 billion to 32 billion. This evaluation included a comprehensive six-point scaling slate of the Qwen2.5-Coder model, as well as a cross-family slate featuring models such as CodeLlama, DeepSeek-Coder-V2, and StarCoder2.

The findings from this evaluation revealed some concerning trends:

  • The strongest model achieved a pass rate of only 84.5% at the first attempt (pass@1).
  • No model family exceeded a 0.77 Edit Similarity score, highlighting the challenges in maintaining accuracy.
  • Every model family generated hallucination-aligned completions for a significant portion of the samples, indicating that the difficulties identified by Delulu are inherent to the tasks themselves rather than specific to any one model family.

Conclusion and Future Directions

With the release of the Delulu benchmark, the research community is now equipped with a vital tool for assessing and improving the reliability of code generation in LLMs. The benchmark, along with the accompanying containers and evaluation framework, is available on GitHub at https://github.com/microsoft/delulu. This initiative not only enhances the understanding of hallucination phenomena in coding tasks but also paves the way for more robust AI models capable of reliable code generation in the future.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.