CDH-Bench: Evaluating Visual Fidelity in Vision-Language Models

Date:

CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

The rapid advancement of Vision-Language Models (VLMs) has significantly improved performance across various benchmarks, yet a critical aspect of their reliability remains inadequately explored. Specifically, the interaction between visual evidence and commonsense reasoning presents a unique challenge. When these two sources of information conflict, it raises an important question: do models prioritize visual evidence or adhere to commonsense logic? This article discusses a newly introduced benchmark known as CDH-Bench, which aims to address this issue by evaluating the phenomenon termed commonsense-driven hallucination (CDH).

Understanding Commonsense-Driven Hallucination (CDH)

Commonsense-driven hallucination refers to instances where VLMs disregard visual cues in favor of commonsense knowledge. This can lead to erroneous conclusions that may not align with the actual visual context, highlighting a significant gap in the reliability of these models. CDH-Bench has been created to systematically evaluate VLMs under conditions where visual evidence conflicts with commonsense understanding.

Overview of CDH-Bench

CDH-Bench is designed to generate explicit conflicts between visual evidence and commonsense reasoning across three primary dimensions:

  • Counting Anomalies: Situations where the number of objects or entities depicted contradicts what is typically expected based on commonsense norms.
  • Relational Anomalies: Scenarios where the relationships between objects in a visual context do not conform to common relational understandings.
  • Attribute Anomalies: Instances where the attributes of objects do not match the commonsense expectations associated with them.

Evaluation Metrics

To effectively assess the performance of VLMs on CDH-Bench, several metrics have been introduced:

  • Counterfactual Accuracy (CF-Acc): Measures how well a model can predict outcomes based on commonsense reasoning.
  • Commonsense Accuracy (CS-Acc): Evaluates the correctness of commonsense assertions made by the model in response to visual stimuli.
  • Counterfactual Accuracy Drop (CFAD): Indicates the extent to which a model’s accuracy decreases when faced with commonsense conflicts.
  • Commonsense Collapse Rate (CCR): Reflects the frequency at which models fail to leverage commonsense knowledge in challenging scenarios.
  • Relative Prior Dependency (RPD): Assesses the degree to which models rely on prior knowledge over visual evidence.

Results and Insights

Preliminary evaluations of frontier VLMs indicate that even the most advanced models exhibit vulnerabilities when confronted with visual evidence that contradicts commonsense knowledge. The results emphasize the importance of developing robust mechanisms for ensuring visual fidelity in VLMs, particularly in situations where commonsense reasoning is paramount.

Conclusion

CDH-Bench serves as a valuable diagnostic tool for understanding the complexities of commonsense-driven hallucinations in VLMs. As researchers continue to refine these models, CDH-Bench provides a structured approach to evaluate and improve visual fidelity in the face of commonsense conflicts, paving the way for more reliable applications of AI in real-world scenarios.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.