LLM-as-Judge Framework to Detect Hallucination in VLMs

Date:

LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

In the rapidly evolving field of artificial intelligence, Vision-Language Models (VLMs) are increasingly being utilized in applications where reliable visual grounding is critical. However, there is a significant gap in understanding how these models behave under varying degrees of prompt coercion. A recent study outlined in arXiv:2604.18803v1 aims to address this gap by examining the phenomenon of hallucination within these models, particularly focusing on how tone influences their output.

Understanding Hallucination in VLMs

Hallucination refers to instances where a model generates incorrect or fabricated information. Current benchmarks for evaluating hallucination primarily employ neutral prompts and binary detection methods, which do not adequately capture the nuances of how VLMs respond to different levels of linguistic pressure. The researchers introduce a novel benchmark known as Ghost-100, which comprises 800 synthetically generated images across eight distinct categories.

Introducing Ghost-100

Ghost-100 is designed to assess the impact of prompt intensity on model performance in three specific task families: text-illegibility, time-reading, and object-absence. Each image is carefully constructed under a negative-ground-truth principle, ensuring that the target queried is inherently absent, illegible, or indeterminate. This framework allows researchers to isolate tone as the primary independent variable by pairing each image with five prompts that vary in directive force.

Evaluation Methodology

The evaluation process employs a dual-track protocol consisting of two main metrics:

  • H-Rate: A rule-based measurement that quantifies the proportion of responses where a model shifts from a grounded refusal to an unsupported positive assertion.
  • H-Score: A GPT-4o-mini-judged score rated on a scale from 1 to 5, which assesses the confidence and specificity of the fabricated responses once they are generated.

Findings and Insights

The study evaluates nine open-weight VLMs, revealing notable distinctions in H-Rate and H-Score across different model families. The results indicate that reading styles and presence-detection subsets respond to prompt pressure in qualitatively diverse manners. Interestingly, several models demonstrate non-monotonic sensitivity, peaking at intermediate tone levels. This finding suggests that the relationship between prompt intensity and model output is more complex than previously understood, and existing aggregate metrics may obscure these critical patterns.

Conclusion

The LLM-as-Judge framework and the Ghost-100 benchmark represent significant advancements in understanding the interplay between tone and hallucination in VLMs. As these models continue to be integrated into real-world applications, the insights gained from this research will be invaluable in enhancing their reliability and effectiveness.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.