URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection
Summary: arXiv:2604.06728v1 Announce Type: cross
Abstract
Multimodal sarcasm detection (MSD) aims to identify sarcastic intent from semantic incongruity between text and image. Although recent methods have improved MSD through cross-modal interaction and incongruity reasoning, they often assume that all modalities are equally reliable. In real-world social media, however, textual content may be ambiguous and visual content may be weakly relevant or even irrelevant, causing deterministic fusion to introduce noisy evidence and weaken robust reasoning.
Introduction
To address the challenges in multimodal sarcasm detection, we propose Uncertainty-aware Robust Multimodal Fusion (URMF), a unified framework that explicitly models modality reliability during interaction and fusion. This innovative approach recognizes the inherent uncertainties in both textual and visual modalities, which can significantly impact the effectiveness of sarcasm detection.
Key Features of URMF
- Multi-Head Cross-Attention: URMF first employs multi-head cross-attention to inject visual evidence into textual representations, enhancing the interaction between different modalities.
- Incongruity-Aware Reasoning: The framework utilizes multi-head self-attention in the fused semantic space to bolster incongruity-aware reasoning, allowing for a more nuanced understanding of sarcasm.
- Aleatoric Uncertainty Modeling: URMF performs unified unimodal aleatoric uncertainty modeling over text, image, and interaction-aware latent representations. Each modality is parameterized as a learnable Gaussian posterior, which enables the model to account for variability in the data.
- Dynamic Modality Regulation: The estimated uncertainty is utilized to dynamically regulate modality contributions during fusion, effectively suppressing unreliable modalities and yielding a more robust joint representation.
- Joint Training Objective: A comprehensive joint training objective integrates task supervision, modality prior regularization, cross-modal distribution alignment, and uncertainty-driven self-sampling contrastive learning, ensuring a well-rounded training process.
Experimental Results
Experiments conducted on publicly available MSD benchmarks demonstrate that URMF consistently outperforms strong unimodal, multimodal, and MLLM-based baselines. The results showcase the framework’s effectiveness in improving both accuracy and robustness in sarcasm detection tasks.
Conclusion
URMF represents a significant advancement in the field of multimodal sarcasm detection by addressing the critical issue of modality reliability. By incorporating uncertainty-aware fusion techniques, URMF not only enhances the accuracy of sarcasm detection but also provides a more resilient framework capable of handling the complexities of real-world social media content. As the landscape of communication continues to evolve, models like URMF will be essential for accurately interpreting nuanced human interactions.
