Jailbreak Attacks on Large Reasoning Models Using Semantic Triggers

Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

Summary: arXiv:2604.15725v1 Announce Type: cross

Abstract: Large Reasoning Models (LRMs) have demonstrated strong capabilities in generating step-by-step reasoning chains alongside final answers, enabling their deployment in high-stakes domains such as healthcare and education. While prior jailbreak attack studies have focused on the safety of final answers, little attention has been given to the safety of the reasoning process. In this work, we identify a novel problem that injects harmful content into the reasoning steps while preserving unchanged answers. This type of attack presents two key challenges: 1) manipulating the input instructions may inadvertently alter the LRM’s final answer, and 2) the diversity of input questions makes it difficult to consistently bypass the LRM’s safety alignment mechanisms and embed harmful content into its reasoning process.

Introduction

The advent of Large Reasoning Models has revolutionized various sectors, including education, healthcare, and beyond. Their ability to provide detailed reasoning alongside final answers positions them as powerful tools. However, concerns have arisen regarding their susceptibility to manipulation through jailbreak attacks.

Challenges in Jailbreak Attacks

This study highlights two primary challenges in executing reasoning-targeted jailbreak attacks:

Altering Final Answers: Manipulating input instructions risks changing the LRM’s final answer, which undermines the attack’s objective.
Diverse Input Questions: The wide variety of input queries complicates attempts to bypass the LRM’s safety mechanisms consistently.

The PRJA Framework

To overcome these challenges, we introduce the Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) Framework. This innovative framework comprises two essential modules:

Semantic-based Trigger Selection Module: This module employs semantic analysis to automatically select manipulative reasoning triggers that can influence the LRM’s outputs without altering its final answers.
Psychology-based Instruction Generation Module: This component utilizes psychological theories such as obedience to authority and moral disengagement to craft adaptive instructions, enhancing the model’s compliance with harmful content generation.

Experimental Results

Our extensive experiments conducted on five question-answering datasets reveal that the PRJA framework achieves an impressive average attack success rate of 83.6%. This success rate spans multiple commercial LRMs, including:

DeepSeek R1
Qwen2.5-Max
OpenAI o4-mini

Conclusion

The findings from this study underscore the pressing need for improved safety measures in Large Reasoning Models. As these systems are increasingly deployed in critical applications, understanding and mitigating potential vulnerabilities is vital. The PRJA framework not only highlights existing weaknesses but also provides a foundation for future research aimed at enhancing the robustness of LRMs against manipulation.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Jailbreak Attacks on Large Reasoning Models Using Semantic Triggers

Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

Introduction

Challenges in Jailbreak Attacks

The PRJA Framework

Experimental Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related