Contrastive Decoding Reduces Score Bias in LLM Evaluations

Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge

Summary: arXiv:2510.18196v2 Announce Type: replace-cross

Abstract

Large Language Models (LLMs) are commonly used as evaluators in various applications, but the reliability of the outcomes remains a challenge. One such challenge is using LLMs-as-judges for direct assessment, i.e., assigning scores from a specified range without any references. Focusing on summarization, we first show that this challenge stems from LLM judge outputs being associated with score range bias, i.e., LLM judge outputs are highly sensitive to pre-defined score ranges. We also show that similar biases exist among models from the same family. We then mitigate this bias through contrastive decoding, achieving up to 11.7% relative improvement on average in Spearman correlation with human judgments across different score ranges.

Introduction

In recent years, the deployment of Large Language Models (LLMs) as evaluators has gained traction across various fields, including education, content creation, and automated reviews. Their ability to process and analyze vast amounts of text makes them ideal candidates for judging quality, coherence, and relevance. However, the effectiveness of these models as judges is often hindered by inherent biases, particularly score range bias.

Understanding Score Range Bias

Score range bias occurs when the outputs generated by LLM judges are disproportionately influenced by the predefined scoring ranges. This bias can lead to inconsistencies in assessments, making it difficult to trust LLM evaluations. Key factors contributing to this issue include:

Pre-defined Score Ranges: LLMs may anchor their responses to these ranges, impacting the variability and accuracy of their scores.
Model Family Bias: Similar biases can be observed among different models within the same family, indicating a systemic issue.
Context Sensitivity: The context provided to LLMs influences their scoring, leading to variations based on how questions are framed.

Mitigating Bias with Contrastive Decoding

To address these challenges, we propose the use of contrastive decoding, a method that helps refine the scoring process by allowing LLMs to compare outputs against a broader set of alternatives rather than relying solely on predefined scoring ranges. This approach enhances the model’s ability to assess quality more reliably, as it encourages a more relative evaluation of outputs.

Results

Our experiments demonstrate that implementing contrastive decoding can lead to significant improvements in the performance of LLMs as judges. Specifically, we observed:

An average improvement of up to 11.7% in Spearman correlation with human judgments across various scoring ranges.
A reduction in score range bias, allowing for more consistent evaluations.
Enhanced trust in LLM assessments, facilitating their application in more critical evaluation scenarios.

Conclusion

The findings underscore the importance of addressing score range bias in LLMs used as judges. By leveraging contrastive decoding, we can enhance the reliability of LLM evaluations, paving the way for more effective applications in content assessment, automated grading, and beyond. Future work will focus on further refining this methodology and exploring its implications across different evaluation contexts.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Contrastive Decoding Reduces Score Bias in LLM Evaluations

Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge

Abstract

Introduction

Understanding Score Range Bias

Mitigating Bias with Contrastive Decoding

Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related