AI Model Benchmarking: Challenges and Insights 2025

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

The landscape of evaluating foundation and generative AI models has undergone a significant transformation. Traditionally, the scientific community relied on peer-reviewed literature to assess and compare the capabilities of various AI systems. However, this paradigm has shifted towards press releases and company blog posts, where AI model builders emphasize results derived from selectively chosen benchmarks. These communications have come to dominate perceptions of state-of-the-art AI advancements among researchers and the general public alike.

Despite the growing significance of these benchmarks, the specific choices made by model builders regarding which benchmarks to highlight—and the implications of these choices—remain largely unexplored. To address this gap, a new study has introduced the Benchmarking-Cultures-25 dataset, comprising 231 distinct benchmarks drawn from 139 model releases throughout 2025, contributed by 11 leading AI developers. This dataset is now available as an open-source resource, along with an interactive tool designed for data exploration.

Key Findings from the Benchmarking-Cultures-25 Dataset

Fragmented Evaluation Landscape: The analysis reveals a fragmented evaluation landscape, with limited comparability across different models. A staggering 63.2% of the highlighted benchmarks are exclusive to a single model builder, while 38.5% are featured in only one release.
Widespread Benchmark Use: Only a handful of benchmarks, such as GPQA Diamond, LiveCodeBench, and AIME 2025, achieve any notable level of widespread utilization.
Diverging Competency Attribution: Benchmarks are attributed with varying competencies by different builders, reflecting their unique narratives and marketing strategies. This inconsistency complicates the process of understanding which benchmarks genuinely measure what they claim.

Understanding the Taxonomy of Benchmarks

To better navigate the conflicting representations of benchmarks, the study introduces a unified taxonomy that maps divergent terminologies to a cohesive framework based on the claimed metrics of benchmark authors. Notably, the category of “general knowledge application” emerges as the second most popular, albeit vaguely defined. This qualitative analysis indicates that many benchmarks prioritize progress towards artificial general intelligence (AGI) rather than adhering to rigorous standards of construct validity.

While authors of these benchmarks often assert that they are measuring knowledge or reasoning in a broad sense, the reality is that the evaluations predominantly focus on STEM-related subjects, particularly mathematics. This observation raises concerns about the overall validity and reliability of these benchmarks as standardized measurement tools.

Conclusions and Implications for the Future

The findings from the Benchmarking-Cultures-25 dataset highlight a critical issue in the current landscape of AI model evaluation. Highlighted benchmarks seem to function less as rigorous measurement tools and more as flexible narrative devices that prioritize market positioning over scientific integrity. As the AI community moves forward, it will be essential to establish more standardized evaluation practices that genuinely reflect the capabilities of AI models and foster meaningful comparisons across different builders.

For further details, you can access the dataset at Benchmarking-Cultures-25 Dataset and explore the interactive tool at Benchmarking Cultures Tool.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

AI Model Benchmarking: Challenges and Insights 2025

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

Key Findings from the Benchmarking-Cultures-25 Dataset

Understanding the Taxonomy of Benchmarks

Conclusions and Implications for the Future

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related