Adaptive Rigor in AI System Evaluation using Temperature-Controlled Verdict Aggregation via Generalized Power Mean
Summary: arXiv:2604.08595v1 Announce Type: cross
Abstract
Existing evaluation methods for LLM-based AI systems, such as LLM-as-a-Judge, verdict systems, and Natural Language Inference (NLI), often struggle to align with human assessments due to their inability to adapt to the strictness required in different application domains. This article introduces a novel method called Temperature-Controlled Verdict Aggregation (TCVA). This method integrates a five-level verdict-scoring system with generalized power-mean aggregation and an intuitive temperature parameter, ranging from 0.1 to 1.0, that allows control over the evaluation rigor.
Key Features of TCVA
TCVA offers several distinctive features that enhance the evaluation of AI systems:
- Five-Level Verdict-Scoring System: The method employs a detailed scoring approach that allows for nuanced evaluations of AI outputs.
- Generalized Power-Mean Aggregation: This technique aggregates verdicts in a way that emphasizes the most relevant scores based on the specified temperature.
- Flexible Temperature Parameter: The temperature parameter can be adjusted to modify the strictness of the evaluation, enabling users to tailor assessments to specific domains.
Temperature Control in Evaluation
The temperature parameter plays a crucial role in determining the nature of the evaluation:
- Low Temperatures (e.g., < 0.5): Yield conservative, pessimistic scores that are particularly suited for safety-critical domains, ensuring that potential risks are thoroughly assessed.
- High Temperatures (e.g., > 0.5): Result in more lenient scores, making them ideal for applications like conversational AI where a more forgiving assessment may enhance user experience.
Experimental Evaluation
To validate the effectiveness of TCVA, experimental evaluations were conducted using three benchmark datasets that included human Likert-scale annotations, specifically SummEval and USR. The results indicate that TCVA achieves a correlation with human judgments that is comparable to RAGAS on faithfulness, with a Spearman correlation coefficient of 0.667 compared to RAGAS’s 0.676. Additionally, TCVA consistently outperformed DeepEval across various scenarios.
Conclusion
The introduction of Temperature-Controlled Verdict Aggregation marks a significant advancement in the evaluation of LLM-based AI systems. By allowing for dynamic control over evaluation rigor, TCVA addresses the limitations of existing methods and provides a more human-aligned assessment framework. Furthermore, its design requires no additional calls to the language model when adjusting the temperature parameter, making it a practical option for developers and researchers working in the field of AI.
