BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning
Summary: arXiv:2604.09378v1 Announce Type: cross
Abstract
Agent ecosystems increasingly rely on installable skills to extend functionality, and some skills bundle learned model artifacts as part of their execution logic. This creates a supply-chain risk that is not captured by prompt injection or ordinary plugin misuse: a third-party skill may appear benign while concealing malicious behavior inside its bundled model.
Introduction
In the rapidly evolving landscape of artificial intelligence, the integration of skills into agent ecosystems presents both opportunities and vulnerabilities. While these skills enhance functionality, they also introduce risks associated with the integrity of the underlying model artifacts. The research presented in the paper titled “BadSkill” sheds light on a novel form of attack that exploits these vulnerabilities.
Understanding BadSkill
BadSkill is a backdoor attack formulation specifically targeting the model-in-skill threat surface. In this context, an adversary can publish a seemingly benign skill, which in reality is backdoor-fine-tuned to execute a hidden payload when certain predefined conditions are met. These conditions are determined by the attacker and often involve specific semantic trigger combinations related to routine skill parameters.
Methodology
The implementation of BadSkill involves training an embedded classifier using a composite objective function. This function combines:
- Classification loss
- Margin-based separation
- Poisons-focused optimization
To evaluate the effectiveness of this attack, the researchers utilized a simulation environment inspired by OpenClaw, which facilitates the installation and execution of third-party skills while allowing for controlled multi-model studies.
Benchmark and Results
The benchmark employed in the study spans 13 distinct skills, comprising 8 triggered tasks and 5 non-trigger control skills. The evaluation set includes:
- 571 negative-class queries
- 396 trigger-aligned queries
Across eight different model architectures, ranging from 494M to 7.1B parameters, BadSkill achieved an impressive average attack success rate (ASR) of up to 99.5% across the triggered skills. Notably, the model maintained strong benign-side accuracy on negative-class queries.
Impact of Poison Rate
The findings also revealed that even a minimal poison rate of 3% can yield an ASR of 91.7%. This indicates that the attack remains effective across various model scales and is resilient to five different types of text perturbations.
Conclusion
The revelations from the BadSkill research highlight the need for heightened vigilance in the management of model-bearing skills within agent ecosystems. These findings underscore the importance of implementing stronger provenance verification and behavioral vetting processes for third-party skill artifacts to mitigate potential risks associated with supply-chain vulnerabilities.
