Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation
Summary: arXiv:2604.12424v1 Announce Type: cross
Abstract
Multimodal Large Language Models (MLLMs) frequently suffer from inference hallucinations, which are largely attributed to language priors overshadowing visual evidence. Traditional training-free mitigation methods either compromise visual representation by perturbing it beyond natural image distributions or impose intrusive manipulations that undermine the model’s generative fluency. This article presents a novel perspective, suggesting that multimodal hallucinations arise from the hypersensitivity of visual grounding to textual phrasing during the decoding phase.
Introduction
With the advent of MLLMs, the integration of visual and textual data has opened new avenues for natural language processing and computer vision. However, the challenge of hallucinations—where the model generates information that is not present in the input data—remains a significant hurdle. Existing methods often lead to either over-perturbed visual inputs or the introduction of non-naturally occurring artifacts, which can detract from the model’s overall performance.
Proposed Framework: Decoding by Perturbation (DeP)
In light of the aforementioned challenges, we introduce Decoding by Perturbation (DeP), a training-free framework designed to mitigate prior-induced hallucinations through controlled textual interventions. This innovative approach is predicated on the observation that the hallucinations are influenced by the sensitivity of visual grounding to the specific phrasing of text inputs.
Key Features of DeP
- Dynamic Probing: DeP utilizes a dynamic probe that applies multi-level textual perturbations, effectively eliciting latent language priors without altering the visual input significantly.
- Attention Variance: By leveraging attention variance, DeP enhances stable regions of evidence while suppressing noise within the feature space, leading to improved model reliability.
- Interpretable Prior Drift Direction: The framework constructs a direction for prior drift based on logits statistics, allowing for the counteraction of probability biases stemming from textual co-occurrences.
Experimental Results
Extensive experiments across multiple benchmarks were conducted to evaluate the effectiveness of DeP. Results indicate that the framework significantly reduces hallucinations and enhances the model’s performance in generating coherent, contextually relevant outputs. The ability of DeP to maintain generative fluency while mitigating the influence of erroneous textual biases represents a pivotal advancement in the field.
Conclusion
In summary, Decoding by Perturbation offers a promising solution to the pervasive issue of hallucinations in MLLMs. By focusing on the interplay between textual phrasing and visual grounding, this approach effectively balances the need for generative fluency with the necessity of accuracy in multimodal outputs. Future research may delve deeper into refining the perturbation techniques and exploring their application across different multimodal tasks.
