Comprehensive Safety Evaluation for Persona-Imbued LLMs

Date:

Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs

Summary: arXiv:2604.11120v1 Announce Type: new

Abstract

Personality imbuing customizes large language model (LLM) behavior, allowing for more tailored and context-aware interactions. However, safety evaluations of these models have predominantly focused on prompt-based personas. In our latest study, we demonstrate that this approach is insufficient. We reveal that prompting and activation steering expose distinct vulnerability profiles that depend on the architecture of the model. Relying on a single evaluation method may overlook significant failure modes within a model.

Key Findings

Our research involves an extensive evaluation across 5,568 judged conditions on four standard models from three different architecture families. We discovered several crucial insights:

  • Persona danger rankings under system prompting exhibit high consistency across all architectures, with correlation coefficients ranging from 0.71 to 0.96.
  • In contrast, vulnerabilities exposed through activation steering diverge sharply and cannot be accurately predicted based on prompt-side rankings.
  • For instance, Llama-3.1-8B demonstrates a significantly higher susceptibility to activation steering, while models such as Gemma-3-27B and Qwen3.5 show increased vulnerability to prompting.

The Prosocial Persona Paradox

One of the most noteworthy findings from our study is the emergence of the *prosocial persona paradox*. Specifically, on the Llama-3.1-8B model, the persona characterized by high conscientiousness and high agreeableness (P12) is deemed among the safest personas when evaluated through prompting. However, this same persona becomes the most vulnerable under activation steering, with an alarming activation-steered risk (ASR) of approximately 0.818.

Implications for Safety Evaluations

This inversion in vulnerability underscores the necessity for comprehensive safety evaluations that extend beyond a single method. Our findings reveal that the traditional approach is inadequate for understanding the complete risk profile of persona-imbued LLMs. The divergence in performance highlights the need for a multi-faceted evaluation framework.

Trait Refusal Alignment Framework

To better understand these vulnerabilities, we propose a trait refusal alignment framework. This framework suggests that a model’s conscientiousness is strongly anti-aligned with refusal behaviors on Llama-3.1-8B. This geometric approach offers partial insights into why certain personas exhibit varying levels of safety across different evaluation methods.

Reasoning and Vulnerability

Our investigations further indicate that reasoning capabilities provide only limited protection against vulnerabilities. Two 32B reasoning models demonstrated a prompt-side ASR of 15% to 18%, with activation steering revealing sharp distinctions in both baseline susceptibility and persona-specific vulnerabilities. Heuristic trace diagnostics imply the safer model maintains stronger policy recall and self-correction behaviors, rather than simply relying on extended reasoning.

Conclusion

In conclusion, our findings advocate for a paradigm shift in the safety evaluation of persona-imputed LLMs. Employing a singular evaluation method is insufficient to grasp the complexities and risks associated with these models. A dual approach that incorporates both prompt-based and activation-steering assessments is essential for a complete understanding of model vulnerabilities and for ensuring their safe deployment in real-world applications.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.