AISafetyBenchExplorer: Unifying AI Safety Benchmarks & Metrics

AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance

Summary: arXiv:2604.12875v1 Announce Type: new

Abstract: The rapid expansion of large language model (LLM) safety evaluation has produced a substantial benchmark ecosystem, but not a correspondingly coherent measurement ecosystem. We present AISafetyBenchExplorer, a structured catalogue of 195 AI safety benchmarks released between 2018 and 2026, organized through a multi-sheet schema that records benchmark-level metadata, metric-level definitions, benchmark-paper metadata, and repository activity.

This design enables meta-analysis not only of what benchmarks exist, but also of how safety is operationalized, aggregated, and judged across the literature. Using the updated catalogue, we identify a central structural problem: benchmark proliferation has outpaced measurement standardization.

Key Findings

The AISafetyBenchExplorer catalogue highlights several critical insights:

Benchmark Proliferation: The current landscape is dominated by medium-complexity benchmarks, with 94 out of 195 benchmarks falling into this category.
Popularity Disparity: Only 7 benchmarks are classified as Popular, indicating a significant gap in widely accepted measures.
Language Concentration: A strong concentration on English-only evaluation is evident, as 165 out of 195 benchmarks are limited to this language.
Resource Type Limitations: A majority of benchmarks (170 out of 195) are evaluation-only resources, pointing towards a lack of comprehensive tools for safety assessment.
Stale Resources: The catalogue reveals that many benchmarks are associated with outdated GitHub repositories (137 out of 195) and stale Hugging Face datasets (96 out of 195).
Reliance on Preprints: There is a heavy reliance on arXiv preprints among benchmarks with known venue metadata, suggesting that peer-reviewed validation is lacking.

Metric Fragmentation

At the metric level, the AISafetyBenchExplorer catalogue demonstrates that commonly used labels such as accuracy, F1 score, safety score, and aggregate benchmark scores often mask materially different judges, aggregation rules, and threat models. This fragmentation creates confusion and undermines the reliability of benchmark comparisons.

Challenges in the Field

The report argues that the main failure mode in the field of AI safety evaluation is fragmentation rather than scarcity. Researchers now have access to numerous benchmark artifacts, but they often lack:

A shared measurement language.
A principled basis for benchmark selection.
Durable stewardship norms for post-publication maintenance.

Conclusion

AISafetyBenchExplorer addresses the existing gaps in the AI safety benchmark landscape by providing a traceable benchmark catalogue, a controlled metadata schema, and a complexity taxonomy. Together, these tools support more rigorous benchmark discovery, comparison, and meta-evaluation, paving the way for improved standards in AI safety research.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

AISafetyBenchExplorer: Unifying AI Safety Benchmarks & Metrics

AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance

Key Findings

Metric Fragmentation

Challenges in the Field

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related