LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models
In the rapidly evolving field of artificial intelligence, Vision-Language Models (VLMs) are increasingly being utilized in applications where reliable visual grounding is critical. However, there is a significant gap in understanding how these models behave under varying degrees of prompt coercion. A recent study outlined in arXiv:2604.18803v1 aims to address this gap by examining the phenomenon of hallucination within these models, particularly focusing on how tone influences their output.
Understanding Hallucination in VLMs
Hallucination refers to instances where a model generates incorrect or fabricated information. Current benchmarks for evaluating hallucination primarily employ neutral prompts and binary detection methods, which do not adequately capture the nuances of how VLMs respond to different levels of linguistic pressure. The researchers introduce a novel benchmark known as Ghost-100, which comprises 800 synthetically generated images across eight distinct categories.
Introducing Ghost-100
Ghost-100 is designed to assess the impact of prompt intensity on model performance in three specific task families: text-illegibility, time-reading, and object-absence. Each image is carefully constructed under a negative-ground-truth principle, ensuring that the target queried is inherently absent, illegible, or indeterminate. This framework allows researchers to isolate tone as the primary independent variable by pairing each image with five prompts that vary in directive force.
Evaluation Methodology
The evaluation process employs a dual-track protocol consisting of two main metrics:
- H-Rate: A rule-based measurement that quantifies the proportion of responses where a model shifts from a grounded refusal to an unsupported positive assertion.
- H-Score: A GPT-4o-mini-judged score rated on a scale from 1 to 5, which assesses the confidence and specificity of the fabricated responses once they are generated.
Findings and Insights
The study evaluates nine open-weight VLMs, revealing notable distinctions in H-Rate and H-Score across different model families. The results indicate that reading styles and presence-detection subsets respond to prompt pressure in qualitatively diverse manners. Interestingly, several models demonstrate non-monotonic sensitivity, peaking at intermediate tone levels. This finding suggests that the relationship between prompt intensity and model output is more complex than previously understood, and existing aggregate metrics may obscure these critical patterns.
Conclusion
The LLM-as-Judge framework and the Ghost-100 benchmark represent significant advancements in understanding the interplay between tone and hallucination in VLMs. As these models continue to be integrated into real-world applications, the insights gained from this research will be invaluable in enhancing their reliability and effectiveness.
