Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning
Summary: arXiv:2510.18034v2 Announce Type: replace-cross
Abstract: Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution semantic anomalies. While VLMs have emerged as promising tools for perception, their application in anomaly detection remains largely restricted to prompting proprietary models – limiting reliability, reproducibility, and deployment feasibility.
To address this gap, we introduce SAVANT (Semantic Anomaly Verification/Analysis Toolkit), a novel model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. By applying SAVANT’s two-phase pipeline – structured scene description extraction and multi-modal evaluation – existing VLMs achieve significantly higher scores in detecting anomalous driving scenarios from input images.
Introduction to SAVANT
SAVANT transforms VLM-based detection into a principled decomposition across four semantic domains. This methodology not only enhances the detection capabilities but also establishes a more reliable framework for anomaly detection in autonomous systems. The framework replaces ad hoc prompting with semantic-aware reasoning, which provides a structured approach to analyzing and verifying the semantic consistency of driving scenarios.
Two-Phase Pipeline
The SAVANT framework operates through a comprehensive two-phase pipeline:
- Structured Scene Description Extraction: This phase involves the extraction of detailed and structured descriptions of scenes from input images, allowing for a clearer understanding of the context and elements present.
- Multi-modal Evaluation: Following the extraction, this phase evaluates the semantic consistency of the detected elements, facilitating the identification of anomalies and inconsistencies.
Results and Improvements
Our approach demonstrates significant improvements over traditional methods. Across a balanced set of real-world driving scenarios, applying SAVANT improves the absolute recall of VLMs by approximately 18.5% compared to prompting baselines. This enhancement not only increases the efficacy of anomaly detection but also opens avenues for reliable large-scale annotation processes.
Large-Scale Annotation and Model Fine-Tuning
Utilizing the best proprietary model within the SAVANT framework, we successfully automated the labeling of around 10,000 real-world images with high confidence. This extensive and high-quality dataset was subsequently used to fine-tune a 7B open-source model (Qwen2.5-VL) for single-shot anomaly detection. The fine-tuned model achieved remarkable results, boasting a 90.8% recall and 93.8% accuracy, thereby surpassing all models evaluated, while enabling local deployment at near-zero cost.
Conclusion
The SAVANT framework represents a significant leap forward in addressing the data scarcity challenges associated with semantic anomaly detection in autonomous systems. By coupling structured semantic reasoning with scalable data curation, SAVANT presents a practical and efficient solution that promises to enhance the reliability and effectiveness of autonomous driving technologies.
Supplementary material: https://SAV4N7.github.io
