Enhancing LLM Security with Sparse Autoencoders

Towards Understanding the Robustness of Sparse Autoencoders

Summary: arXiv:2604.18756v1 Announce Type: cross

Abstract: Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain underexplored. We present a study of integrating pretrained SAEs into transformer residual streams at inference time, without modifying model weights or blocking gradients.

Across four model families (Gemma, LLaMA, Mistral, Qwen) and two strong white-box attacks (GCG, BEAST) plus three black-box benchmarks, SAE-augmented models achieve up to a 5x reduction in jailbreak success rate relative to the undefended baseline and reduce cross-model attack transferability. Parametric ablations reveal:

(i) a monotonic dose-response relationship between L0 sparsity and attack success rate, and
(ii) a layer-dependent defense-utility tradeoff, where intermediate layers balance robustness and clean performance.

These findings are consistent with a representational bottleneck hypothesis: sparse projection reshapes the optimization geometry exploited by jailbreak attacks.

Introduction

As the deployment of Large Language Models (LLMs) becomes more prevalent across various applications, their vulnerabilities have emerged as a significant concern. Specifically, optimization-based jailbreak attacks pose a substantial threat, allowing adversaries to manipulate model behavior by exploiting internal gradients.

Sparse Autoencoders and Their Role

Sparse Autoencoders (SAEs) are a type of neural network that learns efficient representations of data by enforcing sparsity in the encoded features. They have gained traction for their potential in improving the interpretability of deep learning models. However, their capacity to enhance model robustness against adversarial attacks has not been thoroughly investigated.

Methodology

In this study, we integrate pretrained SAEs into the residual streams of transformer architectures during inference. Notably, this integration occurs without altering the model’s weights or obstructing the gradient flow. This innovative approach enables us to assess the impact of SAEs on the resilience of LLMs against different attack vectors.

Results

The experiments were conducted across four prominent model families: Gemma, LLaMA, Mistral, and Qwen. The results indicate a remarkable up to 5x reduction in jailbreak success rates when SAEs are utilized, compared to models that are undefended. This significant improvement suggests that incorporating SAEs not only fortifies the models but also diminishes the transferability of attacks across different models.

Key Findings

Our investigation yielded two critical findings:

The relationship between L0 sparsity and the success rate of attacks is monotonic, highlighting that increasing sparsity can effectively lower vulnerability.
A tradeoff exists between defense utility and model performance, particularly in intermediate layers, suggesting that careful tuning is essential for optimal robustness.

Conclusion

The results of our study reinforce the hypothesis that sparse projections can reshape the optimization landscape that adversarial attacks exploit. By leveraging SAEs, we not only enhance model interpretability but also fortify defenses against jailbreak attempts. Future research will explore the application of these findings to other model architectures and attack types, aiming to further bolster the robustness of LLMs.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Enhancing LLM Security with Sparse Autoencoders

Towards Understanding the Robustness of Sparse Autoencoders

Introduction

Sparse Autoencoders and Their Role

Methodology

Results

Key Findings

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related