Jailbreak Attacks on Large Reasoning Models Using Semantic Triggers

Date:

Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

Summary: arXiv:2604.15725v1 Announce Type: cross

Abstract: Large Reasoning Models (LRMs) have demonstrated strong capabilities in generating step-by-step reasoning chains alongside final answers, enabling their deployment in high-stakes domains such as healthcare and education. While prior jailbreak attack studies have focused on the safety of final answers, little attention has been given to the safety of the reasoning process. In this work, we identify a novel problem that injects harmful content into the reasoning steps while preserving unchanged answers. This type of attack presents two key challenges: 1) manipulating the input instructions may inadvertently alter the LRM’s final answer, and 2) the diversity of input questions makes it difficult to consistently bypass the LRM’s safety alignment mechanisms and embed harmful content into its reasoning process.

Introduction

The advent of Large Reasoning Models has revolutionized various sectors, including education, healthcare, and beyond. Their ability to provide detailed reasoning alongside final answers positions them as powerful tools. However, concerns have arisen regarding their susceptibility to manipulation through jailbreak attacks.

Challenges in Jailbreak Attacks

This study highlights two primary challenges in executing reasoning-targeted jailbreak attacks:

  • Altering Final Answers: Manipulating input instructions risks changing the LRM’s final answer, which undermines the attack’s objective.
  • Diverse Input Questions: The wide variety of input queries complicates attempts to bypass the LRM’s safety mechanisms consistently.

The PRJA Framework

To overcome these challenges, we introduce the Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) Framework. This innovative framework comprises two essential modules:

  • Semantic-based Trigger Selection Module: This module employs semantic analysis to automatically select manipulative reasoning triggers that can influence the LRM’s outputs without altering its final answers.
  • Psychology-based Instruction Generation Module: This component utilizes psychological theories such as obedience to authority and moral disengagement to craft adaptive instructions, enhancing the model’s compliance with harmful content generation.

Experimental Results

Our extensive experiments conducted on five question-answering datasets reveal that the PRJA framework achieves an impressive average attack success rate of 83.6%. This success rate spans multiple commercial LRMs, including:

  • DeepSeek R1
  • Qwen2.5-Max
  • OpenAI o4-mini

Conclusion

The findings from this study underscore the pressing need for improved safety measures in Large Reasoning Models. As these systems are increasingly deployed in critical applications, understanding and mitigating potential vulnerabilities is vital. The PRJA framework not only highlights existing weaknesses but also provides a foundation for future research aimed at enhancing the robustness of LRMs against manipulation.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.