CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments
In a rapidly evolving digital landscape, the advancement of Large Language Models (LLMs) has garnered significant attention due to their potential applications and implications in various domains, including cybersecurity. However, while many existing evaluation frameworks predominantly focus on Information Technology (IT) environments, they often overlook the unique constraints and specialized protocols inherent to Operational Technology (OT) environments. This oversight poses critical challenges in assessing the efficacy of LLMs when applied to specific domains such as digital substations.
To address this pressing gap, a novel framework known as CritBench has been introduced. This framework is specifically designed to evaluate the cybersecurity capabilities of LLM agents operating within IEC 61850 Digital Substation environments. The implementation of CritBench aims to provide a comprehensive assessment of LLM performance, taking into account the specialized requirements of OT systems.
Evaluation Framework Overview
CritBench evaluates five state-of-the-art LLM models, including OpenAI’s GPT-5 suite and select open-weight models. The evaluation is conducted across a corpus of 81 domain-specific tasks that encompass a range of operations, including:
- Static configuration analysis
- Network traffic reconnaissance
- Live virtual machine interaction
To facilitate effective interaction with industrial protocols, the CritBench framework incorporates a domain-specific tool scaffold. This scaffold plays a pivotal role in enhancing the operational capabilities of LLM agents, particularly in contexts where specialized tools are essential for task execution.
Empirical Findings
The empirical results derived from the CritBench evaluations reveal critical insights into the performance of LLM agents. Specifically, it was found that:
- Agents consistently demonstrated reliable execution in static structured-file analysis.
- Single-tool network enumeration tasks were effectively handled by the models.
- However, performance significantly degraded during dynamic tasks that required ongoing interaction and real-time adjustments.
Notably, while the LLMs displayed explicit and internalized knowledge of IEC 61850 standards terminology, they encountered challenges in performing persistent sequential reasoning. This limitation hindered their ability to manipulate live systems effectively without the support of specialized tools. The introduction of the domain-specific tool scaffold has been shown to significantly alleviate this operational bottleneck, enabling more effective interactions within the digital substation environment.
Conclusion and Future Work
The CritBench framework represents a significant advancement in the evaluation of cybersecurity capabilities of LLMs in OT environments. By addressing the unique challenges posed by IEC 61850 Digital Substations, CritBench not only provides a robust evaluation mechanism but also sets the stage for future research and development in this critical area. For those interested in further exploring this framework, the code and evaluation scripts are publicly available at GitHub.
