MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models
Summary: arXiv:2604.19809v1 Announce Type: new
Introduction
Recent advancements in artificial intelligence, particularly in large language models (LLMs), have raised questions about their metacognitive abilities—essentially, their ability to understand and evaluate their own performance. In this context, we introduce MIRROR, a comprehensive benchmark designed to evaluate metacognitive calibration across various levels of self-awareness in LLMs.
Overview of MIRROR
MIRROR comprises eight experiments that span four distinct metacognitive levels. The benchmark aims to determine whether LLMs can leverage self-knowledge to enhance decision-making processes. Our evaluation strategy includes:
- Assessment of 16 models from 8 different laboratories.
- Approximately 250,000 evaluation instances to ensure robust data.
- Five independent behavioral measurement channels to capture diverse performance metrics.
Key Findings
Our experiments yielded two significant phenomena that have important implications for the deployment of LLMs in agentic roles:
- Compositional Self-Prediction Failure: We observed that compositional self-prediction fails universally among the tested models. The Compositional Calibration Error ranged between 0.500 and 0.943 on the original 15-model Exp3-v1 set, and 0.434 to 0.758 on the balanced 16-model Exp3-v2 expansion. This indicates a consistent inability of models to accurately predict their performance on multi-domain tasks.
- Domain-Specific Self-Knowledge: While the models exhibited above-chance but imperfect self-knowledge specific to certain domains, they systematically struggled to translate this partial awareness into effective action-selection. Notably, external metacognitive control was found to be beneficial, reducing the Confident Failure Rate from 0.600 to 0.143—a 76% reduction at a temperature of 0, with a mean reduction of 70% at temperature 0.7 across five models from four labs.
Implications for AI Development
Our research suggests that simply providing models with their own calibration scores does not lead to significant improvements in performance (p > 0.05). The findings indicate that architectural constraints are more effective than self-knowledge for enhancing decision-making in LLMs. This underscores the importance of external metacognitive scaffolding rather than an inherent improvement in self-awareness as a path toward creating safer autonomous AI systems.
Future Directions
As we move forward, we plan to release the code, data, and Croissant metadata publicly in conjunction with the MIRROR benchmark. This initiative will facilitate further research and development in the area of metacognitive calibration in AI, ultimately contributing to the creation of more reliable and capable autonomous systems.
