KellyBench: AI Benchmark for Long-Horizon Decision Making

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

In the evolving landscape of artificial intelligence, recent advancements in language models have led to saturation in benchmarks designed for procedural tasks with narrow objectives. However, the deployment of these models is increasingly shifting towards long-horizon, non-stationary environments featuring open-ended goals. Addressing this gap, researchers have introduced KellyBench, a novel environment specifically designed to evaluate sequential decision-making capabilities in sports betting markets.

KellyBench places agents within a simulated environment representing the 2023-24 English Premier League season. The primary objective for these agents is to maximize their long-term bankroll growth through strategic betting. The environment is enriched with detailed historical data, including advanced statistics, player lineups, and public betting odds. This comprehensive dataset enables agents to build machine learning models, identify edges in public markets, and adapt their strategies as the environment evolves throughout the season.

Key Features of KellyBench

Sequential Simulation: Agents navigate a realistic representation of the Premier League season, providing a dynamic context for decision-making.
Rich Historical Data: Access to advanced player statistics and market odds empowers agents to make informed predictions and decisions.
Focus on Adaptability: The environment challenges agents to adjust their strategies in response to changing conditions and market dynamics.

Initial evaluations of various frontier models within KellyBench revealed that, on average, these models incur losses over the course of the season across five different seeds. The highest-performing model achieved a disappointing average return of -8%, with several models experiencing significant financial ruin. These results highlight the challenges inherent in long-horizon decision-making and the complexities of sports betting markets.

Assessment of Model Performance

To gauge the sophistication of strategies employed by the models, the research team implemented a human expert rubric for evaluation. This assessment process revealed that the strategies developed by the AI models were generally unsophisticated when compared to human baselines. Notably, the Claude Opus 4.6 model garnered a rubric score of 26.5%, indicating considerable potential for enhancement in strategy development and execution.

Implications and Future Directions

The introduction of KellyBench presents a significant advancement in the field of AI, particularly regarding decision-making in complex, real-world scenarios. The findings underscore the necessity for ongoing research aimed at improving model performance in long-horizon tasks. As AI continues to integrate into various sectors, including finance and sports betting, the insights gained from KellyBench may inform the development of more robust and adaptive decision-making frameworks.

KellyBench is currently available as an open-access API endpoint, allowing researchers and developers to explore its capabilities and contribute to the ongoing discourse in the field. Interested parties can access the platform at https://openreward.ai/GeneralReasoning/KellyBench.

As the AI community continues to explore the intricacies of long-horizon sequential decision-making, KellyBench stands as a crucial tool for understanding and enhancing the capabilities of AI in dynamic environments.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

KellyBench: AI Benchmark for Long-Horizon Decision Making

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

Key Features of KellyBench

Assessment of Model Performance

Implications and Future Directions

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related