Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
In a groundbreaking development in the gaming industry, researchers have introduced Orak, a comprehensive benchmark designed to train and evaluate Large Language Model (LLM) agents across a wide array of video games. This initiative is documented in arXiv paper 2506.03610v3.
As LLM agents continue to redefine player interactions and character behaviors in video games, the existing benchmarks have proven inadequate. Many current models do not effectively assess the diverse capabilities of LLMs across different game genres, nor do they explore the agentic modules that play a crucial role in complex gameplay scenarios. Furthermore, the absence of fine-tuning datasets has hindered the adaptation of pre-trained LLMs into effective gaming agents.
Introducing Orak
To address these shortcomings, Orak has been developed as a versatile framework that encompasses 12 popular video games representing all major genres. This benchmark not only evaluates the performance of LLM agents but also facilitates systematic studies on agentic modules in various gaming contexts.
Key Features of Orak
- Diverse Game Coverage: Orak includes games from a variety of genres, ensuring a broad assessment of LLM capabilities.
- Plug-and-Play Interface: Built on the Model Context Protocol (MCP), the interface allows researchers to easily integrate and evaluate different LLM agents.
- Fine-Tuning Datasets: Orak provides a fine-tuning dataset consisting of expert LLM gameplay trajectories, enhancing the performance of general LLMs in gaming environments.
- Comprehensive Evaluation Framework: The benchmark features game leaderboards, LLM battle arenas, and ablation studies to analyze input modality, agentic strategies, and the effects of fine-tuning.
The Importance of Orak
Orak stands out as a foundational tool for researchers and developers aiming to create more intelligent and user-friendly gaming experiences. By providing a unified evaluation framework, it establishes a standard for measuring the effectiveness of LLM agents in gaming. This not only enhances the quality of character interactions but also paves the way for the development of versatile gaming agents capable of adapting to varied gameplay scenarios.
Availability
For those interested in exploring Orak further, the code and datasets are publicly available on GitHub and Hugging Face:
As the gaming landscape evolves, Orak offers a promising pathway towards leveraging LLMs for creating more engaging and dynamic gaming experiences. Researchers and developers alike are encouraged to utilize this benchmark to enhance their understanding and application of AI in gaming, thereby contributing to the future of interactive entertainment.
