Cyber Defense Benchmark: Evaluating LLMs for Threat Hunting

Date:


Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

Summary: arXiv:2604.19533v1 Announce Type: cross

The landscape of cybersecurity is evolving rapidly, with the integration of artificial intelligence (AI) becoming increasingly prevalent in security operations (SecOps). In a recent publication, researchers introduce the Cyber Defense Benchmark, a new framework aimed at evaluating the efficacy of large language model (LLM) agents in performing crucial tasks associated with threat hunting.

Overview of the Cyber Defense Benchmark

The Cyber Defense Benchmark is designed to measure how effectively LLM agents can tackle the core task of threat hunting within a security operations center (SOC). The benchmark presents a unique challenge: given a database of raw Windows event logs devoid of guided questions or hints, the agents must accurately identify the timestamps of malicious events.

This benchmark encompasses 106 real attack procedures sourced from the OTRF Security-Datasets corpus, covering 86 MITRE ATT&CK sub-techniques across 12 tactics. The evaluation is structured as a Gymnasium reinforcement-learning environment, where each episode provides the agent with an in-memory SQLite database containing between 75,000 to 135,000 log records. These log records are generated by a deterministic campaign simulator that applies time-shifting and entity-obfuscation techniques to the raw data.

Methodology of the Evaluation

The agent’s task is to iteratively submit SQL queries to uncover timestamps of malicious events and explicitly flag them. The agents are evaluated using a Capture The Flag (CTF) style scoring system against ground truth derived from Sigma rules. In the evaluation, five leading models were tested: Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash, across 26 campaigns that encompass 105 of the 106 procedures.

Findings and Insights

The results of the evaluation revealed concerning insights into the capabilities of current LLMs in the context of threat hunting:

  • The best-performing model, Claude Opus 4.6, only managed to submit correct flags for an average of 3.8% of malicious events.
  • No model was able to identify all flags in any given run.
  • A passing score was defined as achieving a recall of 50% or more on every ATT&CK tactic, which is considered the minimum requirement for unsupervised SOC deployment.

Despite the advances in AI, no model passed this threshold. The leading model cleared the bar on only 5 out of 13 tactics, while the remaining models failed to meet the criteria on multiple fronts.

Conclusion

These findings suggest that while current LLMs excel at curated Q&A security benchmarks, they are inadequately equipped for the open-ended, evidence-driven task of threat hunting. This underscores the need for further research and development to enhance the capabilities of AI in the realm of cybersecurity.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.