Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
Summary: arXiv:2604.19533v1 Announce Type: cross
The landscape of cybersecurity is evolving rapidly, with the integration of artificial intelligence (AI) becoming increasingly prevalent in security operations (SecOps). In a recent publication, researchers introduce the Cyber Defense Benchmark, a new framework aimed at evaluating the efficacy of large language model (LLM) agents in performing crucial tasks associated with threat hunting.
Overview of the Cyber Defense Benchmark
The Cyber Defense Benchmark is designed to measure how effectively LLM agents can tackle the core task of threat hunting within a security operations center (SOC). The benchmark presents a unique challenge: given a database of raw Windows event logs devoid of guided questions or hints, the agents must accurately identify the timestamps of malicious events.
This benchmark encompasses 106 real attack procedures sourced from the OTRF Security-Datasets corpus, covering 86 MITRE ATT&CK sub-techniques across 12 tactics. The evaluation is structured as a Gymnasium reinforcement-learning environment, where each episode provides the agent with an in-memory SQLite database containing between 75,000 to 135,000 log records. These log records are generated by a deterministic campaign simulator that applies time-shifting and entity-obfuscation techniques to the raw data.
Methodology of the Evaluation
The agent’s task is to iteratively submit SQL queries to uncover timestamps of malicious events and explicitly flag them. The agents are evaluated using a Capture The Flag (CTF) style scoring system against ground truth derived from Sigma rules. In the evaluation, five leading models were tested: Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash, across 26 campaigns that encompass 105 of the 106 procedures.
Findings and Insights
The results of the evaluation revealed concerning insights into the capabilities of current LLMs in the context of threat hunting:
- The best-performing model, Claude Opus 4.6, only managed to submit correct flags for an average of 3.8% of malicious events.
- No model was able to identify all flags in any given run.
- A passing score was defined as achieving a recall of 50% or more on every ATT&CK tactic, which is considered the minimum requirement for unsupervised SOC deployment.
Despite the advances in AI, no model passed this threshold. The leading model cleared the bar on only 5 out of 13 tactics, while the remaining models failed to meet the criteria on multiple fronts.
Conclusion
These findings suggest that while current LLMs excel at curated Q&A security benchmarks, they are inadequately equipped for the open-ended, evidence-driven task of threat hunting. This underscores the need for further research and development to enhance the capabilities of AI in the realm of cybersecurity.
