Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges
Summary: arXiv:2604.19354v1 Announce Type: new
Abstract: Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments.
Introduction
The cybersecurity landscape is evolving, with Large Language Model (LLM) agents being explored for their potential in automating various tasks. However, the effectiveness of these agents in realistic offensive environments is still a matter of ongoing research. In response, the introduction of DeepRed aims to fill this gap by providing a comprehensive benchmark for evaluating LLM agents in Capture The Flag (CTF) challenges.
DeepRed Benchmark
DeepRed places agents in a Kali Linux attacker environment, equipped with terminal tools and the option for web searches. These agents are connected to a target challenge over a private network, allowing for a controlled evaluation of their capabilities. One of the key features of DeepRed is the ability to record full execution traces, enabling detailed analysis of agent performance.
Partial-Credit Scoring Method
To enhance the evaluation process beyond simple binary outcomes of success or failure, DeepRed introduces a partial-credit scoring method. This innovative approach is based on challenge-specific checkpoints that are derived from public writeups. Additionally, an automated summarise-then-judge labeling pipeline is employed to assign completion of these checkpoints from the logs generated during the challenges.
Benchmarking Results
Using the DeepRed framework, researchers benchmarked ten commercially accessible LLMs across ten VM-based CTF challenges, which span various categories. The findings from this evaluation highlight several important insights:
- The highest-performing model achieved an average checkpoint completion rate of only 35%.
- Performance varied significantly across different challenge types, with agents excelling in common tasks.
- Challenges requiring non-standard discovery and longer-horizon adaptation posed significant difficulties for the agents.
Conclusion
While LLM agents show promise in the realm of cybersecurity, the results from the DeepRed benchmark indicate that current models still face considerable limitations. The best-performing agents demonstrated only modest success in completing checkpoints, particularly struggling with complex challenges requiring innovative thinking and adaptability. As the field continues to develop, further research and enhancements to LLM capabilities will be essential for improving their performance in realistic offensive cybersecurity tasks.
Future Directions
The introduction of DeepRed marks a significant step forward in evaluating LLM agents in CTF challenges. Future work may focus on improving the models themselves, enhancing their ability to tackle a wider range of challenges, and refining the evaluation techniques to provide even deeper insights into agent capabilities. The ongoing evolution of AI in cybersecurity will undoubtedly yield new opportunities and challenges, necessitating continuous adaptation and innovation.
