Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs
Summary: arXiv:2604.11120v1 Announce Type: new
Abstract
Personality imbuing customizes large language model (LLM) behavior, allowing for more tailored and context-aware interactions. However, safety evaluations of these models have predominantly focused on prompt-based personas. In our latest study, we demonstrate that this approach is insufficient. We reveal that prompting and activation steering expose distinct vulnerability profiles that depend on the architecture of the model. Relying on a single evaluation method may overlook significant failure modes within a model.
Key Findings
Our research involves an extensive evaluation across 5,568 judged conditions on four standard models from three different architecture families. We discovered several crucial insights:
- Persona danger rankings under system prompting exhibit high consistency across all architectures, with correlation coefficients ranging from 0.71 to 0.96.
- In contrast, vulnerabilities exposed through activation steering diverge sharply and cannot be accurately predicted based on prompt-side rankings.
- For instance, Llama-3.1-8B demonstrates a significantly higher susceptibility to activation steering, while models such as Gemma-3-27B and Qwen3.5 show increased vulnerability to prompting.
The Prosocial Persona Paradox
One of the most noteworthy findings from our study is the emergence of the *prosocial persona paradox*. Specifically, on the Llama-3.1-8B model, the persona characterized by high conscientiousness and high agreeableness (P12) is deemed among the safest personas when evaluated through prompting. However, this same persona becomes the most vulnerable under activation steering, with an alarming activation-steered risk (ASR) of approximately 0.818.
Implications for Safety Evaluations
This inversion in vulnerability underscores the necessity for comprehensive safety evaluations that extend beyond a single method. Our findings reveal that the traditional approach is inadequate for understanding the complete risk profile of persona-imbued LLMs. The divergence in performance highlights the need for a multi-faceted evaluation framework.
Trait Refusal Alignment Framework
To better understand these vulnerabilities, we propose a trait refusal alignment framework. This framework suggests that a model’s conscientiousness is strongly anti-aligned with refusal behaviors on Llama-3.1-8B. This geometric approach offers partial insights into why certain personas exhibit varying levels of safety across different evaluation methods.
Reasoning and Vulnerability
Our investigations further indicate that reasoning capabilities provide only limited protection against vulnerabilities. Two 32B reasoning models demonstrated a prompt-side ASR of 15% to 18%, with activation steering revealing sharp distinctions in both baseline susceptibility and persona-specific vulnerabilities. Heuristic trace diagnostics imply the safer model maintains stronger policy recall and self-correction behaviors, rather than simply relying on extended reasoning.
Conclusion
In conclusion, our findings advocate for a paradigm shift in the safety evaluation of persona-imputed LLMs. Employing a singular evaluation method is insufficient to grasp the complexities and risks associated with these models. A dual approach that incorporates both prompt-based and activation-steering assessments is essential for a complete understanding of model vulnerabilities and for ensuring their safe deployment in real-world applications.
