FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment
Summary: arXiv:2604.04992v1 Announce Type: cross
Abstract
Safety-aligned large language models (LLMs) are designed to go through refusal training to reject harmful requests. However, the effectiveness of these mechanisms under emotionally charged stimuli remains largely unexplored. In this article, we introduce FreakOut-LLM, a pioneering framework that investigates whether emotional context can compromise safety alignment in adversarial settings.
Research Overview
We employed validated psychological stimuli to assess how emotional priming through system prompts influences jailbreak susceptibility across ten different LLMs. The study was structured around three distinct conditions: stress, relaxation, and neutral. Additionally, we included a no-prompt baseline to provide comprehensive insights into the models’ responses.
Methodology
To evaluate the effectiveness of our experiments, we utilized the HarmBench framework on AdvBench prompts. The methodology involved:
- Testing emotional priming effects in three scenarios: stress, relaxation, and neutral.
- Comparing results against a baseline with no emotional prompts.
- Analyzing the jailbreak success rates across all tested models.
Key Findings
The results of our study yielded significant insights:
- Stress priming increased jailbreak success by 65.2% compared to neutral conditions (z = 5.93, p < 0.001; OR = 1.67, Cohen's d = 0.28).
- In contrast, relaxation priming produced no statistically significant effect (p = 0.84).
- Five out of the ten models demonstrated significant vulnerability, particularly among open-weight models, which exhibited the largest susceptibility to emotional context.
Statistical Analysis
We employed logistic regression on a total of 59,800 queries, confirming stress as the sole significant predictor of attack success after controlling for prompt length (p = 0.61) and model identity. Notably, the measured psychological state was a strong predictor of attack success, with correlations exceeding |r| ≥ 0.70 across five different assessment instruments, all yielding p-values < 0.001 in individual-level logistic regression.
Implications
The findings establish emotional context as a measurable attack surface, raising critical implications for the deployment of AI systems in high-stress environments. As emotional stimuli can significantly alter the performance of safety-aligned LLMs, developers must consider these factors in the design and implementation of AI technologies to ensure robust safety measures.
Conclusion
Our research lays the groundwork for further investigation into the intersection of emotional stimuli and AI safety alignment. By understanding how emotional contexts can influence LLM behavior, we can develop more resilient AI systems that are better equipped to handle real-world challenges.
