TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios
Summary: arXiv:2603.29759v1 Announce Type: cross
In recent years, the integration of vision-language models (VLMs) has gained traction in the field of safety hazard assessment, particularly within indoor environments. Despite these advancements, existing benchmarks for VLMs are hindered by several significant limitations that undermine their practical application. This article introduces TSHA, a new benchmark designed to address these deficiencies and enhance the reliability of VLMs in assessing safety hazards.
Limitations of Existing Benchmarks
Current benchmarks face three primary challenges:
- Reliance on Synthetic Datasets: Many benchmarks depend heavily on synthetic datasets generated through simulation software. This reliance creates a substantial domain gap between simulated environments and real-world scenarios, leading to discrepancies in model performance.
- Oversimplified Safety Tasks: Existing benchmarks often present safety tasks that are overly simplified, imposing artificial constraints on hazard types and scene configurations. This limits the generalization capabilities of models, rendering them less effective in diverse real-world situations.
- Lack of Rigorous Evaluation Protocols: There is a notable absence of stringent evaluation protocols to thoroughly assess the capabilities of VLMs in complex home safety contexts. This gap makes it challenging to gauge the true effectiveness of these models in practical applications.
Introducing TSHA
To overcome these challenges, we present TSHA (Trustworthy Safety Hazards Assessment), a comprehensive benchmark consisting of 81,809 meticulously curated training samples. These samples are sourced from four complementary origins:
- Existing indoor datasets that provide a foundational understanding of indoor safety.
- Internet images that capture a wide variety of real-world scenarios.
- AIGC (Artificially Generated Content) images that simulate complex safety environments.
- Newly captured images that reflect current safety conditions and hazards.
In addition to the extensive training set, TSHA includes a rigorously designed test set containing 1,707 samples. This test set features a carefully selected subset from the training distribution, complemented by newly added videos and panoramic images that showcase multiple safety hazards. This design aims to evaluate model robustness in intricate safety scenarios effectively.
Experimental Validation
Extensive experiments conducted on 23 popular VLMs reveal that current models exhibit inadequate capabilities for safety hazard assessment. However, models trained using the TSHA training set demonstrate notable performance improvements. Specifically, these models achieve an impressive performance boost of up to 18.3 points on the TSHA test set.
Moreover, the enhanced models also show improved generalizability across other benchmarks, highlighting the substantial impact and importance of the TSHA benchmark in advancing the field of safety hazard assessment.
Conclusion
TSHA represents a significant step forward in the development of robust visual language models for safety hazard assessment. By addressing the limitations of existing benchmarks and providing a comprehensive evaluation framework, TSHA aims to foster advancements in the reliability and effectiveness of VLMs in real-world safety applications.
