FreakOut-LLM: Emotional Impact on AI Safety Alignment

FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment

Summary: arXiv:2604.04992v1 Announce Type: cross

Abstract

Safety-aligned large language models (LLMs) are designed to go through refusal training to reject harmful requests. However, the effectiveness of these mechanisms under emotionally charged stimuli remains largely unexplored. In this article, we introduce FreakOut-LLM, a pioneering framework that investigates whether emotional context can compromise safety alignment in adversarial settings.

Research Overview

We employed validated psychological stimuli to assess how emotional priming through system prompts influences jailbreak susceptibility across ten different LLMs. The study was structured around three distinct conditions: stress, relaxation, and neutral. Additionally, we included a no-prompt baseline to provide comprehensive insights into the models’ responses.

Methodology

To evaluate the effectiveness of our experiments, we utilized the HarmBench framework on AdvBench prompts. The methodology involved:

Testing emotional priming effects in three scenarios: stress, relaxation, and neutral.
Comparing results against a baseline with no emotional prompts.
Analyzing the jailbreak success rates across all tested models.

Key Findings

The results of our study yielded significant insights:

Stress priming increased jailbreak success by 65.2% compared to neutral conditions (z = 5.93, p < 0.001; OR = 1.67, Cohen's d = 0.28).
In contrast, relaxation priming produced no statistically significant effect (p = 0.84).
Five out of the ten models demonstrated significant vulnerability, particularly among open-weight models, which exhibited the largest susceptibility to emotional context.

Statistical Analysis

We employed logistic regression on a total of 59,800 queries, confirming stress as the sole significant predictor of attack success after controlling for prompt length (p = 0.61) and model identity. Notably, the measured psychological state was a strong predictor of attack success, with correlations exceeding |r| ≥ 0.70 across five different assessment instruments, all yielding p-values < 0.001 in individual-level logistic regression.

Implications

The findings establish emotional context as a measurable attack surface, raising critical implications for the deployment of AI systems in high-stress environments. As emotional stimuli can significantly alter the performance of safety-aligned LLMs, developers must consider these factors in the design and implementation of AI technologies to ensure robust safety measures.

Conclusion

Our research lays the groundwork for further investigation into the intersection of emotional stimuli and AI safety alignment. By understanding how emotional contexts can influence LLM behavior, we can develop more resilient AI systems that are better equipped to handle real-world challenges.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

FreakOut-LLM: Emotional Impact on AI Safety Alignment

FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment

Abstract

Research Overview

Methodology

Key Findings

Statistical Analysis

Implications

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related