YC-Bench: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
As large language models (LLMs) continue to evolve, their ability to tackle increasingly complex tasks raises a pivotal question: can these models maintain strategic coherence over long timeframes? This question is particularly relevant in scenarios involving planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. In response to this challenge, researchers have introduced YC-Bench, a new benchmark designed to evaluate these critical capabilities in AI agents.
Introduction to YC-Bench
YC-Bench tasks an AI agent with the operation of a simulated startup over a one-year horizon consisting of hundreds of turns. Throughout this simulation, the agent is responsible for managing employees, selecting task contracts, and ensuring profitability in a partially observable environment. This environment is further complicated by adversarial clients and a growing payroll, which can lead to compounding consequences for any poor decisions made by the agent.
Evaluation of AI Models
In the study, the performance of 12 different AI models—both proprietary and open source—was rigorously evaluated across three seeds each. The results were telling; only three of the models consistently surpassed the starting capital of $200,000. Notably, Claude Opus 4.6 emerged as the leader, achieving an impressive average final fund of $1.27 million. This was closely followed by GLM-5, which attained $1.21 million while operating at an 11 times lower inference cost.
Key Findings and Insights
One of the most significant insights from the YC-Bench analysis was the role of scratchpad usage. This mechanism serves as the only means for persisting information across context truncation and proved to be the strongest predictor of success among the evaluated models. Conversely, adversarial client detection emerged as a primary failure mode, accounting for 47% of bankruptcies encountered by the AI agents during the simulation.
Challenges and Limitations
The analysis also highlighted that even frontier models continue to encounter distinct failure modes, such as over-parallelization, which underscores the existing capability gaps in long-horizon performance. These findings are crucial, as they not only identify the strengths of various models but also reveal the areas where improvements are necessary.
Conclusion
YC-Bench represents a significant advancement in the evaluation of AI agents’ ability to perform long-term planning and maintain strategic coherence. The benchmark is open-source, reproducible, and configurable, providing a valuable resource for researchers and practitioners aiming to enhance the capabilities of AI systems. As the field continues to evolve, tools like YC-Bench will play a vital role in shaping the future of AI and its applications in complex, real-world scenarios.
References
- arXiv:2604.01212v1
