Human-Guided Harm Recovery for Computer Use Agents
Summary: arXiv:2604.18847v1 Announce Type: new
Abstract: As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but also effectively remediate harm when prevention fails. We formalize a solution to this neglected challenge in post-execution safeguards as harm recovery: the problem of optimally steering an agent from a harmful state back to a safe one in alignment with human preferences.
This article discusses a pioneering approach to harm recovery in the context of computer use agents—automated systems that interact with various software and hardware environments. With the increasing deployment of these systems, the potential for unintentional harm arises, necessitating robust mechanisms to both prevent and address such occurrences.
Key Contributions
- Formative User Study: We conducted a user study to identify valued recovery dimensions, which has produced a natural language rubric that captures human preferences in recovery scenarios.
- Dataset Creation: Our dataset comprises 1,150 pairwise judgments, revealing context-dependent shifts in attribute importance. Notably, users prefer pragmatic and targeted strategies over comprehensive long-term approaches.
- Reward Model Operationalization: The insights gained from the user study have been operationalized into a reward model that dynamically re-ranks multiple candidate recovery plans generated by an agent scaffold during testing.
- Introduction of BackBench: We introduce BackBench, a benchmark of 50 computer-use tasks designed to systematically evaluate an agent’s ability to recover from harmful states.
Evaluation and Results
To assess the effectiveness of our recovery capabilities, we employed rigorous human evaluations. The results indicated that our reward model scaffold consistently yields higher-quality recovery trajectories compared to both base agents and those utilizing rubric-based scaffolds. This is crucial in establishing a new standard in agent safety methods, one that not only prevents harm but also adeptly navigates the aftermath of such incidents.
Implications for Future Research
Our findings lay a foundational framework for developing a new class of safety methods for AI agents. The importance of aligning recovery strategies with human preferences cannot be overstated, as it paves the way for more intuitive and effective interactions between humans and automated systems.
As AI technologies continue to evolve, the integration of human-guided harm recovery mechanisms will be essential. This ensures that not only are agents capable of performing tasks, but they also possess the ability to recover from mistakes in a manner that is acceptable and beneficial to human users.
Conclusion
In conclusion, the landscape of AI and automated agents is rapidly changing, and with it, the necessity for sophisticated harm recovery strategies. Our work represents a significant step forward in ensuring that these agents act not just autonomously but also responsibly, with an emphasis on human alignment and safety.
