Relationship-Aware Safety Unlearning for Multimodal LLMs
Summary: arXiv:2603.14185v3 Announce Type: replace
Abstract
Generative multimodal models can exhibit safety failures that are inherently relational: two benign concepts can become unsafe when linked by a specific action or relation (e.g., child-drinking-wine). Existing unlearning and concept-erasure approaches often target isolated concepts or image-text pairs, which can cause collateral damage to benign uses of the same objects and relations.
Introduction
The rise of generative multimodal models has brought about remarkable advancements in artificial intelligence. However, as these models are increasingly utilized in sensitive applications, ensuring their safety becomes paramount. Safety failures can manifest in various forms, particularly when benign concepts are associated with harmful actions or situations.
Understanding Safety Failures
Safety failures in generative models often occur due to the relational nature of the data they process. For instance, while the concepts of “child” and “wine” may be harmless individually, the relationship between them can lead to dangerous implications. This highlights the need for a more nuanced approach to unlearning unsafe associations.
Challenges of Existing Approaches
Traditional unlearning and concept-erasure methods have primarily focused on isolated concepts or specific image-text pairs. While these methods can mitigate certain risks, they often result in unintended consequences, such as the removal of benign uses of the same objects and relations. This collateral damage underscores the inadequacy of conventional approaches in addressing the complexities of relational safety.
Introducing Relationship-Aware Safety Unlearning
To tackle these challenges, we propose a novel framework known as relationship-aware safety unlearning. This framework explicitly represents unsafe object-relation-object (O-R-O) tuples, allowing for targeted interventions that preserve the integrity of related concepts while suppressing unsafe associations.
- O-R-O Tuple Representation: By mapping out unsafe relationships, the framework can identify and isolate problematic associations without impacting benign uses.
- Parameter-Efficient Edits: Utilizing techniques such as Low-Rank Adaptation (LoRA), the model can apply targeted edits that suppress unsafe tuples, enabling a focused approach to safety unlearning.
- Preservation of Object Marginals: Importantly, the framework ensures that the marginal distributions of safe objects remain intact, fostering a balance between safety and utility.
Experimental Validation
Our approach was rigorously tested through a series of CLIP-based experiments, which demonstrated its effectiveness in mitigating safety failures. Additionally, robustness evaluations were conducted to assess the framework’s performance under various conditions, including paraphrase, contextual, and out-of-distribution image attacks.
Conclusion
The advent of relationship-aware safety unlearning marks a significant step forward in the quest for safer generative multimodal models. By addressing the inherent relational nature of safety failures, this framework not only enhances the reliability of AI systems but also preserves the richness of their functionalities. As we continue to explore the implications of AI in society, prioritizing safety through innovative approaches will be crucial for fostering trust and acceptance in these technologies.
