Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption
The rapid compression of large vision-language models (VLMs) for edge deployment raises an underexplored question: do compact models fail differently, not merely more often? A recent study published on arXiv (arXiv:2603.26769v1) sheds light on this critical issue by comparing the performance of two distinct VLMs when subjected to visual corruption.
The study contrasts a 7-billion-parameter quantised VLM, known as Qwen2.5-VL-7B, which operates on a 4-bit NF4 quantization, with a smaller 500-million-parameter FP16 model named SmolVLM2-500M. The evaluation is based on a comprehensive dataset comprising 4,000 samples drawn from VQAv2 and COCO Captions, providing a robust foundation for comparative analysis.
Methodology
The researchers employed a three-category error taxonomy to diagnose the failure modes of these models. The categories include:
- Object Blindness: The model fails to recognize objects present in the visual input.
- Semantic Drift: The model fails to maintain the intended meaning of the input text in relation to the visual content.
- Prior Bias: The model’s responses are influenced by preconceived notions rather than the actual content.
Findings
Utilizing a text-only GPT-4o judge, the study identified Semantic Drift as the predominant failure mode for Qwen on both VQAv2 and COCO Captions. In contrast, SmolVLM2 exhibited a mixed profile of Object Blindness and Semantic Drift on COCO, while Prior Bias was noted on VQAv2 but absent on COCO for both models.
Furthermore, the researchers assessed confidence calibration via Expected Calibration Error (ECE) using geometric mean token probability. They also probed compositional reasoning with structured negation probes across four templates, leading to a blur robustness experiment as part of the evaluation process.
Results
The compact model, SmolVLM2-500M, displayed a qualitatively distinct failure signature. It exhibited a 12.5 percentage point larger negation collapse compared to Qwen2.5-VL-7B (-33.2pp vs. -20.8pp), with the discrepancy largely influenced by COCO trials. Interestingly, the VQAv2 gap did not reach statistical significance (4.5pp, p=0.19).
The most notable template, false_yn, demonstrated a stark contrast between the two models. SmolVLM2-500M incorrectly responded “Yes” on 100% of COCO trials, erroneously claiming that a depicted object was absent, whereas Qwen2.5-VL-7B had an error rate of only 14% for the same trials.
Conclusion
This study highlights the importance of understanding the distinct failure modes of compressed vision-language models, particularly as they transition to edge deployment. The findings suggest that smaller models may not only fail more frequently but do so in qualitatively different ways. The researchers have also released a fully reproducible pipeline aimed at facilitating systematic safety auditing of compressed VLMs prior to their deployment in real-world applications.
