Robust Multimodal Safety via Conditional Decoding
Summary: arXiv:2604.00310v1 Announce Type: cross
Abstract: Multimodal large-language models (MLLMs) often experience degraded safety alignment when harmful queries exploit cross-modal interactions. Models aligned on text alone show a higher rate of successful attacks when extended to two or more modalities. In this work, we propose a simple conditional decoding strategy, CASA (Classification Augmented with Safety Attention) that utilizes internal representations of MLLMs to predict a binary safety token before response generation.
We introduce a novel safety attention module designed to enhance the model’s ability to detect malicious queries. Our design ensures robust safety alignment without relying on any external classifier or auxiliary head, and without the need for modality-specific safety fine-tuning.
Key Features of CASA
CASA incorporates several innovative features that make it a significant advancement in the field of multimodal model safety:
- Conditional Decoding Strategy: CASA leverages internal representations to assess the safety of inputs before generating responses.
- Safety Attention Module: This novel component improves the detection of harmful queries, ensuring higher safety alignment.
- No External Classifiers Needed: CASA operates independently of external classifiers or auxiliary heads, simplifying implementation.
- No Modality-Specific Fine-Tuning: The framework is designed to be generalizable across various modalities without the need for extensive customization.
Performance Evaluation
The effectiveness of CASA has been validated through extensive empirical testing on various benchmarks. Notable results include:
- MM-SafetyBench: A comprehensive assessment indicating significant improvements in safety alignment.
- JailbreakV-28k: Demonstrated a drastic reduction in attack success rates.
- Adversarial Audio Tests: Effective in mitigating risks associated with audio inputs.
Across these diverse benchmarks, CASA was able to lower the average attack success rate by more than 97% across modalities and attack types.
Utility of CASA
In addition to its safety enhancements, CASA maintains strong utility when processing benign inputs. This was confirmed through both automated evaluations and assessments conducted by 13 trained annotators, ensuring that the model does not compromise on performance while enhancing safety.
Conclusion
The introduction of CASA represents a significant step forward in the quest for robust multimodal safety. By providing a simple and generalizable framework for improving the safety alignment of multimodal large-language models, CASA addresses the pressing need for effective safety mechanisms in an era where cross-modal interactions are increasingly common.
As the field continues to evolve, the implementation of strategies like CASA will be crucial in ensuring that multimodal models can safely and effectively navigate complex interactions without compromising on performance.
