LLM Essay Scoring: Bias and Prompt Effects in Rubrics

Date:

LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias

In recent years, the application of Large Language Models (LLMs) in educational assessment has garnered significant attention. However, the extent to which these models align with human scoring remains a topic of debate. A new study published on arXiv (2604.00259v1) presents a systematic evaluation of instruction-tuned LLMs across three notable open essay-scoring datasets: ASAP 2.0, ELLIPSE, and DREsS. This research focuses on both holistic and analytic scoring, aiming to uncover the nuances of model performance in relation to human raters.

The study employs a comprehensive analysis to evaluate the agreement between LLM scores and human consensus scores, as well as the presence and stability of directional bias in scoring. The findings reveal that strong open-weight models demonstrate moderate to high agreement with human raters on holistic scoring, achieving a Quadratic Weighted Kappa of approximately 0.6. However, this level of agreement does not translate uniformly to analytic scoring, which presents a more complex scenario.

Key Findings

  • Holistic vs. Analytic Scoring: While the models perform well in holistic scoring, the results are less favorable in analytic scoring.
  • Directional Bias: A significant observation is the large and stable negative directional bias on Lower-Order Concern (LOC) traits, such as Grammar and Conventions. This indicates that LLMs often assign harsher scores to these traits compared to human raters.
  • Prompt Efficiency: The research highlights that concise, keyword-based prompts tend to outperform longer, rubric-style prompts when it comes to multi-trait analytic scoring.
  • Sample Size Considerations: To assess the data requirements for detecting systematic deviations in scoring, the authors compute the minimum sample size necessary for achieving a 95% bootstrap confidence interval that excludes zero for mean bias.

Implications for Educational Assessment

The implications of these findings are significant for the deployment of LLMs in educational settings. The results suggest a bias-correction-first strategy may be most effective. Instead of relying solely on raw zero-shot scores, the study advocates for the estimation and correction of systematic score offsets through the use of small human-labeled bias-estimation sets. This approach does not necessitate large-scale fine-tuning, making it a more practical solution for educators and institutions.

In conclusion, while LLMs show promise in supporting essay scoring, the study emphasizes the need for careful consideration of their biases, particularly in relation to LOC traits. By adopting strategies that prioritize bias correction, educators can enhance the reliability and fairness of AI-assisted assessments, ultimately benefiting students and educators alike.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.