Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge
Summary: arXiv:2604.03742v1 Announce Type: new
Abstract: Effective evaluation of large language models (LLMs) remains a critical bottleneck, as conventional direct scoring often yields inconsistent and opaque judgments. In this work, we adapt the Analytic Hierarchy Process (AHP) to LLM-based evaluation and, more importantly, propose a confidence-aware Fuzzy AHP (FAHP) extension that models epistemic uncertainty via triangular fuzzy numbers modulated by LLM-generated confidence scores.
Introduction
The evaluation of large language models has become increasingly important as their applications expand across various domains. However, traditional evaluation methods often fall short, leading to inconsistent results that do not adequately capture the nuances of model performance.
Methodology
To address these challenges, the researchers have introduced an innovative approach that incorporates the Analytic Hierarchy Process (AHP) into LLM evaluations. The incorporation of AHP allows for a structured breakdown of assessment criteria, which enhances the clarity and consistency of the evaluation process.
Furthermore, the proposed confidence-aware Fuzzy AHP (FAHP) extension introduces a novel way to account for uncertainty. By utilizing triangular fuzzy numbers and LLM-generated confidence scores, FAHP effectively captures the variability in model performance, leading to more robust evaluations.
Validation and Results
The new evaluation framework was systematically validated using JudgeBench, a comprehensive benchmark for assessing LLMs. The results from extensive experiments indicate that both crisp and fuzzy AHP methods consistently outperform traditional direct scoring across various model scales and dataset splits. Notably, FAHP demonstrated superior stability in scenarios characterized by uncertain comparisons.
DualJudge Framework
Building on the insights gathered from these experiments, the authors propose DualJudge, a hybrid evaluation framework inspired by Dual-Process Theory. DualJudge combines holistic direct scores with structured AHP outputs through consistency-aware weighting, enabling a more nuanced evaluation process.
This innovative fusion of intuitive and deliberative evaluation paradigms allows DualJudge to achieve state-of-the-art performance in LLM assessment, highlighting the complementary strengths of both approaches.
Conclusion
The results of this study underscore the importance of uncertainty-aware structured reasoning in the evaluation of large language models. By adopting a confidence-aware approach and integrating established methodologies like AHP, the proposed frameworks pave the way for more reliable assessments of LLM performance.
Resources
For those interested in exploring the underlying code and methodologies, the authors have made the code available at the following link:
