Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing
Summary: arXiv:2604.15725v1 Announce Type: cross
Abstract: Large Reasoning Models (LRMs) have demonstrated strong capabilities in generating step-by-step reasoning chains alongside final answers, enabling their deployment in high-stakes domains such as healthcare and education. While prior jailbreak attack studies have focused on the safety of final answers, little attention has been given to the safety of the reasoning process. In this work, we identify a novel problem that injects harmful content into the reasoning steps while preserving unchanged answers. This type of attack presents two key challenges: 1) manipulating the input instructions may inadvertently alter the LRM’s final answer, and 2) the diversity of input questions makes it difficult to consistently bypass the LRM’s safety alignment mechanisms and embed harmful content into its reasoning process.
Introduction
The advent of Large Reasoning Models has revolutionized various sectors, including education, healthcare, and beyond. Their ability to provide detailed reasoning alongside final answers positions them as powerful tools. However, concerns have arisen regarding their susceptibility to manipulation through jailbreak attacks.
Challenges in Jailbreak Attacks
This study highlights two primary challenges in executing reasoning-targeted jailbreak attacks:
- Altering Final Answers: Manipulating input instructions risks changing the LRM’s final answer, which undermines the attack’s objective.
- Diverse Input Questions: The wide variety of input queries complicates attempts to bypass the LRM’s safety mechanisms consistently.
The PRJA Framework
To overcome these challenges, we introduce the Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) Framework. This innovative framework comprises two essential modules:
- Semantic-based Trigger Selection Module: This module employs semantic analysis to automatically select manipulative reasoning triggers that can influence the LRM’s outputs without altering its final answers.
- Psychology-based Instruction Generation Module: This component utilizes psychological theories such as obedience to authority and moral disengagement to craft adaptive instructions, enhancing the model’s compliance with harmful content generation.
Experimental Results
Our extensive experiments conducted on five question-answering datasets reveal that the PRJA framework achieves an impressive average attack success rate of 83.6%. This success rate spans multiple commercial LRMs, including:
- DeepSeek R1
- Qwen2.5-Max
- OpenAI o4-mini
Conclusion
The findings from this study underscore the pressing need for improved safety measures in Large Reasoning Models. As these systems are increasingly deployed in critical applications, understanding and mitigating potential vulnerabilities is vital. The PRJA framework not only highlights existing weaknesses but also provides a foundation for future research aimed at enhancing the robustness of LRMs against manipulation.
