Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
Summary: arXiv:2510.18196v2 Announce Type: replace-cross
Abstract
Large Language Models (LLMs) are commonly used as evaluators in various applications, but the reliability of the outcomes remains a challenge. One such challenge is using LLMs-as-judges for direct assessment, i.e., assigning scores from a specified range without any references. Focusing on summarization, we first show that this challenge stems from LLM judge outputs being associated with score range bias, i.e., LLM judge outputs are highly sensitive to pre-defined score ranges. We also show that similar biases exist among models from the same family. We then mitigate this bias through contrastive decoding, achieving up to 11.7% relative improvement on average in Spearman correlation with human judgments across different score ranges.
Introduction
In recent years, the deployment of Large Language Models (LLMs) as evaluators has gained traction across various fields, including education, content creation, and automated reviews. Their ability to process and analyze vast amounts of text makes them ideal candidates for judging quality, coherence, and relevance. However, the effectiveness of these models as judges is often hindered by inherent biases, particularly score range bias.
Understanding Score Range Bias
Score range bias occurs when the outputs generated by LLM judges are disproportionately influenced by the predefined scoring ranges. This bias can lead to inconsistencies in assessments, making it difficult to trust LLM evaluations. Key factors contributing to this issue include:
- Pre-defined Score Ranges: LLMs may anchor their responses to these ranges, impacting the variability and accuracy of their scores.
- Model Family Bias: Similar biases can be observed among different models within the same family, indicating a systemic issue.
- Context Sensitivity: The context provided to LLMs influences their scoring, leading to variations based on how questions are framed.
Mitigating Bias with Contrastive Decoding
To address these challenges, we propose the use of contrastive decoding, a method that helps refine the scoring process by allowing LLMs to compare outputs against a broader set of alternatives rather than relying solely on predefined scoring ranges. This approach enhances the model’s ability to assess quality more reliably, as it encourages a more relative evaluation of outputs.
Results
Our experiments demonstrate that implementing contrastive decoding can lead to significant improvements in the performance of LLMs as judges. Specifically, we observed:
- An average improvement of up to 11.7% in Spearman correlation with human judgments across various scoring ranges.
- A reduction in score range bias, allowing for more consistent evaluations.
- Enhanced trust in LLM assessments, facilitating their application in more critical evaluation scenarios.
Conclusion
The findings underscore the importance of addressing score range bias in LLMs used as judges. By leveraging contrastive decoding, we can enhance the reliability of LLM evaluations, paving the way for more effective applications in content assessment, automated grading, and beyond. Future work will focus on further refining this methodology and exploring its implications across different evaluation contexts.
