Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
Summary: arXiv:2604.12384v1 Announce Type: new
Abstract
Safety alignment in Large Language Models (LLMs) remains highly fragile during fine-tuning, where even benign adaptation can degrade pre-trained refusal behaviors and enable harmful responses. Existing defenses typically constrain either weights or activations in isolation, without considering their coupled effects on safety. In this paper, we first theoretically demonstrate that constraining either weights or activations alone is insufficient for safety preservation. To robustly preserve safety alignment, we propose Coupled Weight and Activation Constraints (CWAC), a novel approach that simultaneously enforces a precomputed safety subspace on weight updates and applies targeted regularization to safety-critical features identified by sparse autoencoders. Extensive experiments across four widely used LLMs and diverse downstream tasks show that CWAC consistently achieves the lowest harmful scores with minimal impact on fine-tuning accuracy, substantially outperforming strong baselines even under high harmful data ratios.
Introduction
The development of Large Language Models (LLMs) has revolutionized various fields, including natural language processing, content generation, and interactive AI systems. However, ensuring the safety and reliability of these models remains a significant challenge, particularly during the fine-tuning phase. In this article, we explore the critical issue of safety drift and the innovative solution proposed in the recent paper, “Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints.”
Understanding Safety Drift
Safety drift refers to the phenomenon where LLMs begin to exhibit harmful behaviors despite initial training that aimed to mitigate such responses. This degradation can occur due to various factors, including:
- Changes in model weights during fine-tuning.
- Shifts in activations that occur due to benign adaptations.
- The introduction of harmful data during the training process.
Current Limitations of Existing Defenses
Current methods aimed at preserving safety typically focus on either constraining weights or activations in isolation. While this can offer some level of protection, it fails to account for the interdependent nature of these two components. As demonstrated in the discussed paper, solely relying on one approach can lead to insufficient safety preservation.
Introducing Coupled Weight and Activation Constraints (CWAC)
The authors of the paper propose a novel method called Coupled Weight and Activation Constraints (CWAC). This approach addresses the limitations of existing methods by:
- Simultaneously enforcing a precomputed safety subspace on weight updates.
- Applying targeted regularization to safety-critical features identified through sparse autoencoders.
Experimental Validation
To validate the effectiveness of CWAC, extensive experiments were conducted across four widely used LLMs and multiple downstream tasks. The results showed:
- CWAC consistently achieved the lowest harmful scores.
- Minimal impact on fine-tuning accuracy.
- Superior performance compared to strong baselines, even when faced with high ratios of harmful data.
Conclusion
The findings from this paper underscore the importance of a holistic approach to safety in LLMs. By recognizing the coupled relationship between weights and activations, CWAC represents a significant advancement in the field of AI safety. As the landscape of LLM applications continues to evolve, ensuring robust safety measures will be paramount for fostering trust and reliability in AI systems.
