PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment
Summary: arXiv:2604.08986v1 Announce Type: cross
In recent years, the use of persona prompting has gained traction as a method to guide large language models (LLMs) in their behavior and enhance their performance on various tasks. By assigning specific characters or personas to these models, researchers aim to improve instruction adherence and overall output quality. However, the challenge of identifying the optimal persona for each task can be both time-consuming and complex, with the effects of different personas on output quality remaining largely uncharted territory.
Previous studies primarily focused on addressing persona sensitivity at the prompt level through inference-time strategies, which often require additional computational resources. In contrast, the current research shifts its focus to the training phase, aiming to develop models capable of adapting their behavior to a variety of personas while maintaining robust task performance.
Key Findings
The research reveals that utilizing reinforcement learning with verifiable rewards (RLVR) can systematically reduce sensitivity to persona prompts. However, this approach also uncovers a critical trade-off associated with outcome-based optimization. While RLVR enhances the robustness of models on tasks with clear, verifiable goals, it can inadvertently compromise the expressivity of the assigned persona. This is particularly evident in scenarios requiring in-character role-playing, where a model may struggle to maintain its persona under the constraints of RLVR.
Introducing PerMix-RLVR
To mitigate the limitations identified in RLVR, the authors propose a novel strategy known as PerMix-RLVR. This persona-mixed reinforcement learning approach is designed to balance the trade-off between robustness and fidelity. By preserving strong resilience against harmful variations in persona, PerMix-RLVR allows for more faithful persona adoption when the situation demands it.
Performance Metrics
Empirical results demonstrate the effectiveness of PerMix-RLVR in enhancing both persona stability and fidelity. Specifically, the implementation of this strategy resulted in a significant improvement in the persona stability score (PSS) by +21.2% on the MATH500 dataset. Furthermore, it also achieved an impressive enhancement in persona fidelity, showcasing a +11.4% increase on the PersonaGym evaluation.
Conclusion
The advancements presented in this research signify a substantial step forward in the realm of LLM persona adaptation. By addressing the fundamental challenges associated with persona prompting and implementing the PerMix-RLVR strategy, the authors pave the way for more reliable and expressive models capable of effectively navigating a diverse array of personas. This work not only contributes to our understanding of persona sensitivity but also provides practical solutions that can be leveraged in future developments of language models.
Future Work
Looking ahead, further exploration into the nuances of persona interaction and the effects of varying training methodologies will be critical. Continued investigation into the balance between robustness and expressivity could lead to even more sophisticated models that better understand and embody human-like characteristics.
