Steered LLM Activations are Non-Surjective
Summary: arXiv:2604.09839v1 Announce Type: new
Abstract: Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations) and safety research (e.g., studying jailbreakability). However, it is unclear whether steered activation states are realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a pre-image under the model’s natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.
Introduction
In the landscape of artificial intelligence, particularly in large language models (LLMs), the concept of activation steering has emerged as a significant area of study. This technique allows researchers to manipulate the internal activations of LLMs to achieve desired changes in output behavior. However, the effectiveness and limitations of this approach prompt critical examination.
Key Findings
- Non-Surjectivity of Steered Activations: The research establishes that steered activation states are generally not achievable through any arbitrary textual prompt, raising questions about the limitations of white-box control techniques.
- Residual Stream Behavior: The findings indicate that activation steering alters the model’s internal processes in a way that moves them outside the realm of states that can be reached using standard prompting methods.
- Empirical Validation: This study validates its theoretical claims through empirical tests conducted on three widely utilized LLMs, confirming the theoretical implications across different model architectures.
Implications for Interpretability and Safety Research
The results of this study have profound implications for the fields of interpretability and safety in AI research. While activation steering has been viewed as a tool for enhancing the transparency of model behavior, the findings suggest a distinct boundary between white-box techniques and traditional prompt-based methods.
Conclusion
As the community continues to explore the intricacies of LLM behavior, it is crucial to acknowledge the limitations of activation steering as an interpretability method. The study advocates for the development of evaluation protocols that clearly differentiate between white-box and black-box interventions to prevent misconceptions regarding the efficacy of prompt-based interpretability.
Future Directions
Further research is necessary to explore alternative methods that bridge the gap between white-box steerability and black-box prompting. Understanding these dynamics will be vital for advancing the safety and interpretability of AI systems.
