SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents
Summary: arXiv:2604.12040v1 Announce Type: cross
In the rapidly evolving landscape of cybersecurity, the ability of autonomous security incident response agents to effectively investigate and respond to incidents is crucial. A new benchmark, known as SIR-Bench, has been introduced to evaluate these agents on their investigation depth and forensic capabilities.
Introduction to SIR-Bench
SIR-Bench comprises 794 test cases designed specifically to assess the performance of autonomous security incident response agents. Unlike traditional methods that may merely focus on alert generation, SIR-Bench emphasizes the importance of genuine forensic investigation, enabling a clear distinction between agents that merely parrot alerts and those that engage in meaningful investigations.
Framework Development: Once Upon A Threat (OUAT)
To construct the SIR-Bench, the authors developed a framework called Once Upon A Threat (OUAT). This innovative framework replays real incident patterns in controlled cloud environments, generating authentic telemetry data that reflects measurable investigation outcomes.
Evaluation Methodology
The evaluation of the incident response agents using SIR-Bench is grounded in three complementary metrics:
- Triage Accuracy (M1): Measures the accuracy of the agent’s triage decisions.
- Novel Finding Discovery (M2): Assesses the agent’s ability to uncover new evidence through active investigation.
- Tool Usage Appropriateness (M3): Evaluates whether the tools used in the investigation are appropriate for the context.
These metrics are analyzed through an adversarial LLM-as-Judge, which inverts the burden of proof. This unique approach requires that concrete forensic evidence be provided to validate the effectiveness of an investigation.
Results and Findings
In an evaluation of a security incident response agent against the SIR-Bench benchmark, the results were promising:
- True Positive Detection Rate: 97.1%
- False Positive Rejection Rate: 73.4%
- Novel Key Findings: An average of 5.67 novel findings per case.
These findings establish a robust baseline against which future investigation agents can be measured, highlighting the importance of not just alerting but also engaging in deeper forensic analysis.
Conclusion
SIR-Bench represents a significant advancement in the evaluation of autonomous security incident response agents. By focusing on the depth of investigation rather than superficial alert generation, it paves the way for more effective security solutions capable of tackling the complexities of modern cyber threats. As the field of cybersecurity continues to grow, benchmarks like SIR-Bench will be essential in guiding the development and implementation of more sophisticated and competent incident response agents.
