SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills
Summary: arXiv:2604.06550v1 Announce Type: cross
In the fast-evolving landscape of artificial intelligence, security vulnerabilities in AI agent skills pose a significant threat. OpenClaw’s ClawHub marketplace, which hosts over 13,000 community-contributed agent skills, has revealed that between 13% and 26% of these skills contain security vulnerabilities according to recent audits. Traditional methods of detecting these vulnerabilities, such as regex scanners and formal static analyzers, often fall short. Regex scanners can miss obfuscated payloads, while formal analyzers struggle to interpret natural language instructions where prompt injection and social engineering attacks may be concealed.
Introducing SkillSieve
To address these limitations, SkillSieve has been developed as a three-layer detection framework that applies progressively deeper analysis only where necessary. This innovative approach enhances efficiency while improving detection rates for security vulnerabilities in AI agent skills. The framework operates as follows:
- Layer 1: The initial layer runs regex, Abstract Syntax Tree (AST), and metadata checks through an XGBoost-based feature scorer. This filtering process efficiently eliminates roughly 86% of benign skills in under 40 milliseconds on average, incurring zero API cost.
- Layer 2: Suspicious skills identified in the first layer are sent for deeper analysis by a Large Language Model (LLM). However, rather than posing a single broad question, Layer 2 divides the analysis into four parallel sub-tasks:
- Intent Alignment
- Permission Justification
- Covert Behavior Detection
- Cross-file Consistency
Each sub-task has its own prompt and structured output, allowing for a nuanced examination of potential risks.
- Layer 3: Skills deemed high-risk are presented before a jury of three different LLMs. These models vote independently on the risk level of the skill. In cases of disagreement, the LLMs engage in a debate to reach a consensus verdict, introducing a collaborative decision-making process.
Evaluation and Performance
The effectiveness of SkillSieve has been evaluated using a dataset of 49,592 real ClawHub skills, alongside adversarial samples across five distinct evasion techniques. The full pipeline was implemented on a 440 ARM single-board computer, showcasing its efficiency and practicality.
On a benchmark of 400 labeled skills, SkillSieve achieved an impressive F1 score of 0.800, significantly outperforming ClawVet, which recorded an F1 score of 0.421. Notably, the average cost per skill analyzed by SkillSieve was only 0.006, emphasizing the framework’s cost-effectiveness.
Open Source Commitment
In line with the principles of transparency and collaboration, the authors have made the code, data, and benchmark for SkillSieve open-sourced. This initiative encourages further research and development in the field of AI security, fostering a community-driven approach to enhancing the safety and reliability of AI agent skills.
SkillSieve represents a significant advancement in the detection of malicious AI agent skills, combining efficiency, accuracy, and innovative methodologies to address a pressing challenge in the AI landscape.
