Evaluating Large Language Models with Fuzzy AHP & DualJudge

Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge

Summary: arXiv:2604.03742v1 Announce Type: new

Abstract: Effective evaluation of large language models (LLMs) remains a critical bottleneck, as conventional direct scoring often yields inconsistent and opaque judgments. In this work, we adapt the Analytic Hierarchy Process (AHP) to LLM-based evaluation and, more importantly, propose a confidence-aware Fuzzy AHP (FAHP) extension that models epistemic uncertainty via triangular fuzzy numbers modulated by LLM-generated confidence scores.

Introduction

The evaluation of large language models has become increasingly important as their applications expand across various domains. However, traditional evaluation methods often fall short, leading to inconsistent results that do not adequately capture the nuances of model performance.

Methodology

To address these challenges, the researchers have introduced an innovative approach that incorporates the Analytic Hierarchy Process (AHP) into LLM evaluations. The incorporation of AHP allows for a structured breakdown of assessment criteria, which enhances the clarity and consistency of the evaluation process.

Furthermore, the proposed confidence-aware Fuzzy AHP (FAHP) extension introduces a novel way to account for uncertainty. By utilizing triangular fuzzy numbers and LLM-generated confidence scores, FAHP effectively captures the variability in model performance, leading to more robust evaluations.

Validation and Results

The new evaluation framework was systematically validated using JudgeBench, a comprehensive benchmark for assessing LLMs. The results from extensive experiments indicate that both crisp and fuzzy AHP methods consistently outperform traditional direct scoring across various model scales and dataset splits. Notably, FAHP demonstrated superior stability in scenarios characterized by uncertain comparisons.

DualJudge Framework

Building on the insights gathered from these experiments, the authors propose DualJudge, a hybrid evaluation framework inspired by Dual-Process Theory. DualJudge combines holistic direct scores with structured AHP outputs through consistency-aware weighting, enabling a more nuanced evaluation process.

This innovative fusion of intuitive and deliberative evaluation paradigms allows DualJudge to achieve state-of-the-art performance in LLM assessment, highlighting the complementary strengths of both approaches.

Conclusion

The results of this study underscore the importance of uncertainty-aware structured reasoning in the evaluation of large language models. By adopting a confidence-aware approach and integrating established methodologies like AHP, the proposed frameworks pave the way for more reliable assessments of LLM performance.

Resources

For those interested in exploring the underlying code and methodologies, the authors have made the code available at the following link:

GitHub Repository

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Evaluating Large Language Models with Fuzzy AHP & DualJudge

Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge

Introduction

Methodology

Validation and Results

DualJudge Framework

Conclusion

Resources

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related