Agent Safety Risks: How Benign Instructions Hide Vulnerabilities

Date:

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Summary: arXiv:2604.10577v2 Announce Type: replace-cross

Abstract

Computer-use agents (CUAs) have rapidly evolved to autonomously complete complex tasks in various digital environments. However, a significant concern arises when these agents are misled, as they can be programmed to automate harmful actions. Existing safety evaluations primarily focus on explicit threats, such as misuse and prompt injection, neglecting a more subtle yet critical scenario where user instructions seem benign. Harm can emerge from the context of the task or the outcome of its execution.

Introducing OS-BLIND

In response to these challenges, we introduce OS-BLIND, a benchmark designed to evaluate CUAs under unintended attack conditions. This benchmark comprises:

  • 300 human-crafted tasks
  • 12 categories of tasks
  • 8 different applications
  • 2 distinct threat clusters: environment-embedded threats and agent-initiated harms

Evaluation Results

Our evaluation of leading models and agentic frameworks reveals alarming findings:

  • Most CUAs achieved an attack success rate (ASR) exceeding 90%.
  • Even the safety-aligned Claude 4.5 Sonnet recorded a significant 73.0% ASR.
  • Notably, when deployed in multi-agent systems, the ASR for Claude 4.5 Sonnet surged to 92.7%.

Analysis of Vulnerabilities

This vulnerability in CUAs becomes even more pronounced when they operate within multi-agent systems. Our analysis highlights several key points:

  • Existing safety defenses offer limited protection when user instructions are benign.
  • Safety alignment tends to activate only during the initial steps of task execution and rarely re-engages in subsequent phases.
  • In multi-agent scenarios, the decomposition of subtasks can obscure harmful intents from the model, leading to failures in safety alignment.

Call to Action

We are committed to addressing these pressing safety challenges and will be releasing OS-BLIND to encourage the broader research community to investigate and develop robust solutions. The findings underscore the need for a paradigm shift in how we evaluate and enhance the safety of computer-use agents, especially in scenarios where user instructions appear innocuous.

As CUAs become integral to various digital environments, understanding and mitigating these vulnerabilities is essential to ensure their safe and responsible deployment. The future of agent safety lies in our ability to recognize and address these blind spots effectively.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.