GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers
Summary: arXiv:2604.02648v1 Announce Type: cross
Abstract: The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs).
Introduction
In software engineering, ensuring the reliability and functionality of applications is paramount. As software systems grow in complexity, the task of identifying bugs becomes increasingly difficult. Traditional methods of bug discovery, while effective, often require substantial human oversight and intervention. With advancements in artificial intelligence, particularly large language models, there is significant interest in leveraging these technologies for autonomous bug detection.
The Game Benchmark for Quality Assurance (GBQA)
Recognizing the challenges faced by LLMs in this domain, researchers have developed the Game Benchmark for Quality Assurance (GBQA). This benchmark serves as a testing ground for evaluating the capabilities of LLMs in detecting software bugs, specifically within the context of game development.
Key Features of GBQA
- Comprehensive Dataset: GBQA encompasses 30 games and 124 human-verified bugs categorized across three difficulty levels.
- Multi-Agent System: The benchmark employs a multi-agent system that autonomously develops games while injecting bugs at scale, ensuring a robust testing environment.
- Human Oversight: Human experts are involved in the loop to validate the correctness of the injected bugs, thereby enhancing the benchmark’s reliability.
- Baseline Interactive Agent: A baseline interactive agent is provided, equipped with a multi-round ReAct loop and a memory mechanism, allowing for extensive exploration of game environments.
Experimental Findings
Extensive experiments conducted on prominent LLMs indicate that autonomous bug discovery poses significant challenges. The highest-performing model, Claude-4.6-Opus in thinking mode, managed to identify only 48.39% of the verified bugs. This statistic underscores the current limitations that LLMs face in effectively detecting software bugs.
Conclusions
The GBQA benchmark represents a critical step forward in the evaluation of LLMs as potential quality assurance engineers. By providing a structured environment for testing, it not only sets a baseline for performance but also highlights the gaps that remain in autonomous software engineering. As research progresses, it is hoped that improvements in LLM capabilities will eventually lead to more effective autonomous bug discovery, streamlining the software development process and enhancing overall software quality.
In conclusion, the development and implementation of benchmarks like GBQA are essential in the pursuit of integrating AI into software engineering practices, paving the way for innovations that can transform how bugs are detected and addressed in modern applications.
