ASGuard: Mitigating Jailbreaking in Large Language Models

Date:

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Summary: arXiv:2509.25843v2 Announce Type: replace

Abstract: Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood.

In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. Our approach consists of several key steps:

  • Circuit Analysis: We begin by utilizing circuit analysis to identify specific attention heads that are causally linked to targeted jailbreaking attacks, such as those involving tense changes.
  • Channel-wise Scaling Vector: Next, we train a precise, channel-wise scaling vector designed to recalibrate the activation of the identified tense-vulnerable heads.
  • Preventative Fine-tuning: Finally, we apply this scaling vector into a “preventative fine-tuning” process. This forces the model to learn a more robust refusal mechanism, thereby enhancing its resistance to jailbreaking attempts.

Across four different LLMs, ASGuard has demonstrated significant effectiveness in reducing the attack success rate of targeted jailbreaking. Notably, the framework has achieved this while preserving the general capabilities of the models and minimizing instances of over-refusal. This balance is crucial, as it leads to what we describe as a Pareto-optimal outcome between safety and utility.

Our findings illuminate how adversarial suffixes can suppress the propagation of refusal-mediating directions within the model’s architecture. This mechanistic analysis not only clarifies the vulnerabilities present in LLMs but also showcases the potential for a deeper understanding of model internals to inform better safety practices.

Moreover, the ASGuard framework exemplifies how targeted methods can be developed to adjust model behavior in practical and efficient ways. This research charts a promising course for enhancing the reliability and interpretability of AI safety mechanisms.

In conclusion, the introduction of ASGuard represents a significant step forward in addressing the vulnerabilities associated with large language models. As AI technology continues to evolve, ensuring the safety and alignment of these models will be paramount. ASGuard not only contributes to this goal but also sets a precedent for future research in AI safety and alignment strategies.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.