Distorted or Fabricated? A Survey on Hallucination in Video LLMs
Summary: arXiv:2604.12944v1 Announce Type: cross
Abstract
Despite significant progress in video-language modeling, hallucinations remain a persistent challenge in Video Large Language Models (Vid-LLMs). Hallucinations refer to outputs that appear plausible yet contradict the content of the input video. This survey presents a comprehensive analysis of hallucinations in Vid-LLMs and introduces a systematic taxonomy that categorizes them into two core types: dynamic distortion and content fabrication, each comprising two subtypes with representative cases.
Understanding Hallucinations in Vid-LLMs
The phenomenon of hallucinations in Vid-LLMs can severely impact the reliability of systems that rely on this technology. This survey aims to consolidate the scattered advancements in this field to foster a systematic understanding of these hallucinations. The taxonomy introduced serves as a framework for categorizing hallucinations and understanding their implications.
Taxonomy of Hallucinations
The identified types of hallucinations in Vid-LLMs are as follows:
- Dynamic Distortion:
- Subtype A: Distortions that affect the temporal coherence of the output.
- Subtype B: Errors that arise from misinterpretation of motion dynamics in the video.
- Content Fabrication:
- Subtype A: Inventions of characters or objects that do not exist in the original video.
- Subtype B: Creation of misleading narratives that contradict the video content.
Recent Advances in Evaluation and Mitigation
In addition to introducing a taxonomy, the survey reviews recent advances in the evaluation and mitigation of hallucinations. Some key areas of focus include:
- Benchmarks: Evaluation frameworks that provide metrics for assessing the performance of Vid-LLMs in terms of hallucination occurrences.
- Intervention Strategies: Techniques aimed at minimizing the incidence of hallucinations through improved model architectures and training methodologies.
Root Causes of Hallucinations
The survey delves into analyzing the root causes of the two major types of hallucinations:
- Limited Capacity for Temporal Representation: Many models struggle to accurately capture the temporal dynamics of video content, leading to distortions.
- Insufficient Visual Grounding: A lack of robust visual context often results in fabricated content that does not align with the actual video input.
Future Directions
To address the challenges posed by hallucinations in Vid-LLMs, the survey proposes several promising directions for future research, including:
- Development of motion-aware visual encoders that improve temporal understanding.
- Integration of counterfactual learning techniques to enhance model reliability and grounding.
This survey lays the groundwork for building robust and reliable video-language systems, contributing significantly to the ongoing discourse in the field. For those interested in exploring this subject further, an up-to-date curated list of related works can be found at https://github.com/hukcc/Awesome-Video-Hallucination.
