DACO: Enhancing Safety in Multimodal Large Language Models

Date:

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

Summary: arXiv:2604.08846v1 Announce Type: cross

Abstract

Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference time has recently emerged as a flexible and effective solution.

However, existing steering methods for MLLMs typically handle only a narrow set of safety-related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations.

Innovative Framework: DACO

The DACO framework operates in three key phases:

  • Curating a Concept Dictionary: We curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption-image stimuli and summarizing their activations into concept directions. This dataset is named DACO-400K.
  • Intervening Activations: The curated dictionary can be utilized to intervene activations via sparse coding, allowing for targeted adjustments that enhance safety.
  • New Steering Approach: We propose a new steering approach that employs our dictionary to initialize the training of an SAE. This SAE automatically annotates the semantics of its atoms for effective safeguarding of MLLMs.

Experimental Validation

We conducted experiments on multiple MLLMs, including QwenVL, LLaVA, and InternVL, across various safety benchmarks such as MM-SafetyBench and JailBreakV. The results demonstrated that DACO significantly improves MLLM safety while maintaining general-purpose capabilities. This achievement is particularly noteworthy given the increasing complexity and sophistication of malicious patterns targeting AI models.

Impact and Future Directions

The implications of DACO extend beyond immediate safety improvements. By providing granular control over model activations, it opens new avenues for research in AI safety and ethics. Future work may explore:

  • Expanding the concept dictionary to include more nuanced and diverse multimodal concepts.
  • Investigating the long-term effects of DACO on model performance and safety.
  • Developing user-friendly interfaces for practitioners to easily implement DACO in their own MLLM applications.

Conclusion

In a rapidly evolving landscape of AI safety, the Dictionary-Aligned Concept Control (DACO) framework represents a significant step forward in safeguarding Multimodal Large Language Models. By leveraging a curated dictionary and innovative sparse coding techniques, DACO not only enhances safety but also preserves the versatile capabilities of MLLMs, paving the way for more resilient and reliable AI systems.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.