Cyber Defense Benchmark: Evaluating LLMs for Threat Hunting

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

Summary: arXiv:2604.19533v1 Announce Type: cross

The landscape of cybersecurity is evolving rapidly, with the integration of artificial intelligence (AI) becoming increasingly prevalent in security operations (SecOps). In a recent publication, researchers introduce the Cyber Defense Benchmark, a new framework aimed at evaluating the efficacy of large language model (LLM) agents in performing crucial tasks associated with threat hunting.

Overview of the Cyber Defense Benchmark

The Cyber Defense Benchmark is designed to measure how effectively LLM agents can tackle the core task of threat hunting within a security operations center (SOC). The benchmark presents a unique challenge: given a database of raw Windows event logs devoid of guided questions or hints, the agents must accurately identify the timestamps of malicious events.

This benchmark encompasses 106 real attack procedures sourced from the OTRF Security-Datasets corpus, covering 86 MITRE ATT&CK sub-techniques across 12 tactics. The evaluation is structured as a Gymnasium reinforcement-learning environment, where each episode provides the agent with an in-memory SQLite database containing between 75,000 to 135,000 log records. These log records are generated by a deterministic campaign simulator that applies time-shifting and entity-obfuscation techniques to the raw data.

Methodology of the Evaluation

The agent’s task is to iteratively submit SQL queries to uncover timestamps of malicious events and explicitly flag them. The agents are evaluated using a Capture The Flag (CTF) style scoring system against ground truth derived from Sigma rules. In the evaluation, five leading models were tested: Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash, across 26 campaigns that encompass 105 of the 106 procedures.

Findings and Insights

The results of the evaluation revealed concerning insights into the capabilities of current LLMs in the context of threat hunting:

The best-performing model, Claude Opus 4.6, only managed to submit correct flags for an average of 3.8% of malicious events.
No model was able to identify all flags in any given run.
A passing score was defined as achieving a recall of 50% or more on every ATT&CK tactic, which is considered the minimum requirement for unsupervised SOC deployment.

Despite the advances in AI, no model passed this threshold. The leading model cleared the bar on only 5 out of 13 tactics, while the remaining models failed to meet the criteria on multiple fronts.

Conclusion

These findings suggest that while current LLMs excel at curated Q&A security benchmarks, they are inadequately equipped for the open-ended, evidence-driven task of threat hunting. This underscores the need for further research and development to enhance the capabilities of AI in the realm of cybersecurity.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Cyber Defense Benchmark: Evaluating LLMs for Threat Hunting

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

Overview of the Cyber Defense Benchmark

Methodology of the Evaluation

Findings and Insights

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related