DACO: Enhancing Safety in Multimodal Large Language Models

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

Summary: arXiv:2604.08846v1 Announce Type: cross

Abstract

Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference time has recently emerged as a flexible and effective solution.

However, existing steering methods for MLLMs typically handle only a narrow set of safety-related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations.

Innovative Framework: DACO

The DACO framework operates in three key phases:

Curating a Concept Dictionary: We curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption-image stimuli and summarizing their activations into concept directions. This dataset is named DACO-400K.
Intervening Activations: The curated dictionary can be utilized to intervene activations via sparse coding, allowing for targeted adjustments that enhance safety.
New Steering Approach: We propose a new steering approach that employs our dictionary to initialize the training of an SAE. This SAE automatically annotates the semantics of its atoms for effective safeguarding of MLLMs.

Experimental Validation

We conducted experiments on multiple MLLMs, including QwenVL, LLaVA, and InternVL, across various safety benchmarks such as MM-SafetyBench and JailBreakV. The results demonstrated that DACO significantly improves MLLM safety while maintaining general-purpose capabilities. This achievement is particularly noteworthy given the increasing complexity and sophistication of malicious patterns targeting AI models.

Impact and Future Directions

The implications of DACO extend beyond immediate safety improvements. By providing granular control over model activations, it opens new avenues for research in AI safety and ethics. Future work may explore:

Expanding the concept dictionary to include more nuanced and diverse multimodal concepts.
Investigating the long-term effects of DACO on model performance and safety.
Developing user-friendly interfaces for practitioners to easily implement DACO in their own MLLM applications.

Conclusion

In a rapidly evolving landscape of AI safety, the Dictionary-Aligned Concept Control (DACO) framework represents a significant step forward in safeguarding Multimodal Large Language Models. By leveraging a curated dictionary and innovative sparse coding techniques, DACO not only enhances safety but also preserves the versatile capabilities of MLLMs, paving the way for more resilient and reliable AI systems.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

DACO: Enhancing Safety in Multimodal Large Language Models

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

Abstract

Innovative Framework: DACO

Experimental Validation

Impact and Future Directions

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related