KellyBench: AI Benchmark for Long-Horizon Decision Making

Date:

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

In the evolving landscape of artificial intelligence, recent advancements in language models have led to saturation in benchmarks designed for procedural tasks with narrow objectives. However, the deployment of these models is increasingly shifting towards long-horizon, non-stationary environments featuring open-ended goals. Addressing this gap, researchers have introduced KellyBench, a novel environment specifically designed to evaluate sequential decision-making capabilities in sports betting markets.

KellyBench places agents within a simulated environment representing the 2023-24 English Premier League season. The primary objective for these agents is to maximize their long-term bankroll growth through strategic betting. The environment is enriched with detailed historical data, including advanced statistics, player lineups, and public betting odds. This comprehensive dataset enables agents to build machine learning models, identify edges in public markets, and adapt their strategies as the environment evolves throughout the season.

Key Features of KellyBench

  • Sequential Simulation: Agents navigate a realistic representation of the Premier League season, providing a dynamic context for decision-making.
  • Rich Historical Data: Access to advanced player statistics and market odds empowers agents to make informed predictions and decisions.
  • Focus on Adaptability: The environment challenges agents to adjust their strategies in response to changing conditions and market dynamics.

Initial evaluations of various frontier models within KellyBench revealed that, on average, these models incur losses over the course of the season across five different seeds. The highest-performing model achieved a disappointing average return of -8%, with several models experiencing significant financial ruin. These results highlight the challenges inherent in long-horizon decision-making and the complexities of sports betting markets.

Assessment of Model Performance

To gauge the sophistication of strategies employed by the models, the research team implemented a human expert rubric for evaluation. This assessment process revealed that the strategies developed by the AI models were generally unsophisticated when compared to human baselines. Notably, the Claude Opus 4.6 model garnered a rubric score of 26.5%, indicating considerable potential for enhancement in strategy development and execution.

Implications and Future Directions

The introduction of KellyBench presents a significant advancement in the field of AI, particularly regarding decision-making in complex, real-world scenarios. The findings underscore the necessity for ongoing research aimed at improving model performance in long-horizon tasks. As AI continues to integrate into various sectors, including finance and sports betting, the insights gained from KellyBench may inform the development of more robust and adaptive decision-making frameworks.

KellyBench is currently available as an open-access API endpoint, allowing researchers and developers to explore its capabilities and contribute to the ongoing discourse in the field. Interested parties can access the platform at https://openreward.ai/GeneralReasoning/KellyBench.

As the AI community continues to explore the intricacies of long-horizon sequential decision-making, KellyBench stands as a crucial tool for understanding and enhancing the capabilities of AI in dynamic environments.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.