Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks
In the realm of artificial intelligence, particularly in the development of Large Language Models (LLMs) for code generation, a pressing concern has emerged: the phenomenon of hallucinations. These are instances where the models generate plausible yet incorrect completions during Fill-in-the-Middle (FIM) tasks. Such errors can include invented API methods, invalid parameters, undefined variables, or non-existent imports, leading to runtime errors that can be detrimental to software development.
Addressing this critical issue, researchers have introduced Delulu, a verified multi-lingual benchmark designed specifically to detect code hallucinations in FIM tasks. This innovative benchmark comprises 1,951 FIM samples sourced from seven programming languages and categorized into four distinct hallucination types. The samples were meticulously curated through an adversarial pipeline that ensures a comprehensive evaluation of code generation accuracy.
The Curated Process Behind Delulu
The development of Delulu involved a multi-step process aimed at enhancing the quality and reliability of the benchmark:
- Generation of Plausible Hallucinations: A frontier LLM was tasked with generating plausible but incorrect code completions.
- Evaluation by Diverse Judge Models: Four different judge models assessed the generated hallucinations, ensuring a varied perspective on the validity of the outputs.
- Mining Harder Examples: Embedding-based clustering techniques were utilized to identify progressively challenging examples, further refining the benchmark.
- Verification through Docker Containers: Self-contained Docker containers were employed to verify that the golden completions compiled correctly while the hallucinated variants consistently produced the expected runtime errors.
- Final Human-Expert Review: A thorough review by human experts was conducted to eliminate any biased or trivially decidable samples, ensuring the integrity of the benchmark.
Evaluation of Open-Weight FIM Models
Delulu has been utilized to evaluate 11 open-weight FIM models from five different families, with parameter sizes ranging from 0.5 billion to 32 billion. This evaluation included a comprehensive six-point scaling slate of the Qwen2.5-Coder model, as well as a cross-family slate featuring models such as CodeLlama, DeepSeek-Coder-V2, and StarCoder2.
The findings from this evaluation revealed some concerning trends:
- The strongest model achieved a pass rate of only 84.5% at the first attempt (pass@1).
- No model family exceeded a 0.77 Edit Similarity score, highlighting the challenges in maintaining accuracy.
- Every model family generated hallucination-aligned completions for a significant portion of the samples, indicating that the difficulties identified by Delulu are inherent to the tasks themselves rather than specific to any one model family.
Conclusion and Future Directions
With the release of the Delulu benchmark, the research community is now equipped with a vital tool for assessing and improving the reliability of code generation in LLMs. The benchmark, along with the accompanying containers and evaluation framework, is available on GitHub at https://github.com/microsoft/delulu. This initiative not only enhances the understanding of hallucination phenomena in coding tasks but also paves the way for more robust AI models capable of reliable code generation in the future.
Related AI Insights
- MATRA: Secure Agentic AI Systems | OpenClaw Case Study
- TrajPrism: Benchmark for Language-Grounded Urban Trajectory AI
- CLEF: Advanced EEG Model for Clinical Semantic Analysis
- Why AI Deployment Needs Calibrated Verification Now
- Budget-Efficient Automatic Algorithm Design Using Code Graph
- Stable RL Alignment with Unified Pair-GRPO Preference Constraints
- diffGHOST: Privacy-Preserving Synthetic Mobility Trajectories
- GESR: Advanced Genetic Programming for Symbolic Regression
- MaD Physics: AI Measurement Strategies Under Constraints
- Evolving-RL: Optimizing Experience-Driven Self-Evolving Agents
