Delulu: Multi-Lingual Benchmark for Detecting Code Hallucinations

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

In the realm of artificial intelligence, particularly in the development of Large Language Models (LLMs) for code generation, a pressing concern has emerged: the phenomenon of hallucinations. These are instances where the models generate plausible yet incorrect completions during Fill-in-the-Middle (FIM) tasks. Such errors can include invented API methods, invalid parameters, undefined variables, or non-existent imports, leading to runtime errors that can be detrimental to software development.

Addressing this critical issue, researchers have introduced Delulu, a verified multi-lingual benchmark designed specifically to detect code hallucinations in FIM tasks. This innovative benchmark comprises 1,951 FIM samples sourced from seven programming languages and categorized into four distinct hallucination types. The samples were meticulously curated through an adversarial pipeline that ensures a comprehensive evaluation of code generation accuracy.

The Curated Process Behind Delulu

The development of Delulu involved a multi-step process aimed at enhancing the quality and reliability of the benchmark:

Generation of Plausible Hallucinations: A frontier LLM was tasked with generating plausible but incorrect code completions.
Evaluation by Diverse Judge Models: Four different judge models assessed the generated hallucinations, ensuring a varied perspective on the validity of the outputs.
Mining Harder Examples: Embedding-based clustering techniques were utilized to identify progressively challenging examples, further refining the benchmark.
Verification through Docker Containers: Self-contained Docker containers were employed to verify that the golden completions compiled correctly while the hallucinated variants consistently produced the expected runtime errors.
Final Human-Expert Review: A thorough review by human experts was conducted to eliminate any biased or trivially decidable samples, ensuring the integrity of the benchmark.

Evaluation of Open-Weight FIM Models

Delulu has been utilized to evaluate 11 open-weight FIM models from five different families, with parameter sizes ranging from 0.5 billion to 32 billion. This evaluation included a comprehensive six-point scaling slate of the Qwen2.5-Coder model, as well as a cross-family slate featuring models such as CodeLlama, DeepSeek-Coder-V2, and StarCoder2.

The findings from this evaluation revealed some concerning trends:

The strongest model achieved a pass rate of only 84.5% at the first attempt (pass@1).
No model family exceeded a 0.77 Edit Similarity score, highlighting the challenges in maintaining accuracy.
Every model family generated hallucination-aligned completions for a significant portion of the samples, indicating that the difficulties identified by Delulu are inherent to the tasks themselves rather than specific to any one model family.

Conclusion and Future Directions

With the release of the Delulu benchmark, the research community is now equipped with a vital tool for assessing and improving the reliability of code generation in LLMs. The benchmark, along with the accompanying containers and evaluation framework, is available on GitHub at https://github.com/microsoft/delulu. This initiative not only enhances the understanding of hallucination phenomena in coding tasks but also paves the way for more robust AI models capable of reliable code generation in the future.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Delulu: Multi-Lingual Benchmark for Detecting Code Hallucinations

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

The Curated Process Behind Delulu

Evaluation of Open-Weight FIM Models

Conclusion and Future Directions

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related