Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities
Summary: arXiv:2604.12191v1 Announce Type: new
Current evaluations of large language models (LLMs) typically aggregate performance across diverse tasks into single scores. This aggregation often obscures fine-grained variations in abilities, which limits both targeted model improvement and the ability-guided selection for specific tasks. Motivated by this significant gap in the existing evaluation methodologies, researchers have proposed a comprehensive cognitive diagnostic framework aimed at estimating model abilities across multiple fine-grained dimensions.
A key aspect of this framework is its application to mathematics, where a 35-dimensional ability taxonomy has been constructed. This taxonomy is grounded in cognitive theory and domain knowledge, allowing for a more nuanced understanding of model performance. The framework utilizes multidimensional Item Response Theory (IRT) supported by an item-ability association matrix to estimate these fine-grained ability levels.
The implications of this framework are substantial. By enabling the prediction of performance on unseen items—specifically questions from benchmarks—the researchers have demonstrated strong criterion validity. Evaluated on 41 different models, the approach has shown consistent ability estimates across various benchmarks, alongside accurate predictions of unseen items. The area under the curve (AUC) for these predictions ranges from 0.80 to 0.89 within benchmarks and from 0.77 to 0.86 across different benchmarks, significantly outperforming trivial baselines.
Generalization Across Scientific Domains
One of the remarkable features of this diagnostic framework is its ability to generalize across multiple scientific domains. The framework has been effectively applied to various fields, including:
- Physics: Utilizing a 27-dimensional ability taxonomy.
- Chemistry: Implementing a comprehensive 58-dimensional ability taxonomy.
- Computer Science: Applying a 12-dimensional ability taxonomy.
This versatility underscores the framework’s robustness and its potential for broad application in educational and research settings.
Potential Applications
The establishment of this principled framework for fine-grained assessment of abilities opens the door to numerous practical applications, including:
- Targeted Training: Tailoring training programs to address specific weaknesses in model abilities.
- Ability-Guided Model Selection: Choosing models based on their strengths in particular tasks.
- Ability-Aware Benchmark Design: Creating benchmarks that accurately reflect the fine-grained abilities of models.
In conclusion, the introduction of a cognitive diagnostic framework for evaluating large language models promises to enhance our understanding of their capabilities. By moving beyond traditional single-score evaluations, this approach enables a more detailed and actionable analysis that can significantly impact the development and deployment of AI models across various domains.
