Good Scores, Bad Data: A Metric for Multimodal Coherence
Summary: arXiv:2603.25924v1 Announce Type: cross
Abstract
Multimodal AI systems are increasingly evaluated based on their performance in downstream tasks, such as accuracy in Visual Question Answering (VQA). However, achieving high accuracy does not necessarily imply that the underlying data used by these models is coherent. In many cases, a model can perform well on VQA while still utilizing inputs that contradict one another. To address this issue, we introduce the Multimodal Coherence Score (MCS), a novel metric designed to evaluate the quality of data fusion independently of any downstream model performance.
Introducing the Multimodal Coherence Score (MCS)
The MCS breaks down coherence into four distinct dimensions:
- Identity: Ensures that entities in the input data maintain consistent representation throughout the fusion process.
- Spatial: Assesses the spatial relationships between elements within the data.
- Semantic: Evaluates the meaningfulness and relevance of the information presented.
- Decision: Analyzes how decisions are made based on the fused data.
Weights for these dimensions are learned through the Nelder-Mead optimization method, providing a robust framework for assessing data coherence.
Evaluation and Results
To validate the effectiveness of the MCS, we conducted evaluations on a dataset comprising 1,000 Visual Genome images, utilizing advanced models including DETR, CLIP, and ViLT. Additionally, we performed validation on 150 COCO images without any retraining, ensuring the robustness of our approach across different datasets.
Following our extensive analysis, we found that the MCS demonstrated a superior ability to discriminate data quality compared to traditional task accuracy metrics. Specifically, we observed a Spearman correlation coefficient of 0.093 for MCS, in contrast to a mere 0.071 for task accuracy. This indicates that MCS possesses a higher sensitivity in identifying issues related to data coherence.
Perturbation Experiments
To further substantiate our findings, we conducted perturbation experiments which confirmed that each dimension of the MCS responds independently to its specific failure modes. Notably, we observed zero cross-talk between the dimensions, allowing for precise diagnostic capabilities regarding the nature of data coherence failures.
Conclusion
The Multimodal Coherence Score (MCS) represents a significant advancement in the evaluation of multimodal AI systems. It is lightweight, requires no human annotation, and provides not only a diagnosis of failure but also insights into the specific areas where data coherence is lacking. By employing MCS, researchers and practitioners can better understand and improve the quality of multimodal data, leading to more reliable AI systems.
