SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
Summary: arXiv:2604.19638v1 Announce Type: new
Abstract: Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards.
In recent years, the deployment of Multimodal Large Language Models (MLLMs) has surged, especially in domains where autonomous agents interact with users in complex environments. Despite their growing presence, a major concern persists regarding these systems’ capability to recognize and mitigate safety hazards effectively. To address this issue, we present SafetyALFRED, a novel approach that incorporates safety evaluations into the existing ALFRED benchmark. This enhanced framework aims to assess the safety-conscious planning capabilities of various MLLMs.
Key Features of SafetyALFRED
- Integration of Real-World Hazards: SafetyALFRED is designed with six distinct categories of real-world kitchen hazards, enhancing the relevance and applicability of the safety assessments.
- Evaluation Beyond Recognition: Unlike traditional assessments that focus solely on hazard recognition through disembodied question answering (QA), SafetyALFRED evaluates models on their ability to actively mitigate risks through embodied planning.
- Comprehensive Model Testing: The framework includes rigorous testing of eleven state-of-the-art models from the Qwen, Gemma, and Gemini families, providing insights into their safety capabilities.
Findings and Implications
Our experimental results reveal a significant alignment gap between hazard recognition and risk mitigation. While the models demonstrated a high accuracy rate in recognizing hazards during QA settings, their average success rates for effectively mitigating these hazards were surprisingly low. This discrepancy highlights a critical gap in the current evaluation paradigms, where static assessments fall short of addressing the dynamic nature of physical safety.
These findings advocate for a paradigm shift in the way safety evaluations are conducted, urging the research community to prioritize benchmarks that emphasize corrective actions in embodied contexts. The implications of this research are profound, suggesting that future models must not only identify safety hazards but also implement effective strategies to mitigate them in real-time environments.
Open-Source Contribution
To foster further research and development in this critical area, we are pleased to announce that we are open-sourcing our code and dataset. Researchers can access the SafetyALFRED framework at https://github.com/sled-group/SafetyALFRED.git. We encourage the community to utilize this resource to enhance the safety evaluation of multimodal large language models and contribute to the advancement of safer autonomous agents.
As MLLMs continue to evolve and integrate into various domains, ensuring their safety and reliability is paramount. SafetyALFRED represents a significant step forward in addressing these challenges, paving the way for a safer and more responsible deployment of AI technologies in interactive environments.
