LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
In recent years, the application of Large Language Models (LLMs) in educational assessment has garnered significant attention. However, the extent to which these models align with human scoring remains a topic of debate. A new study published on arXiv (2604.00259v1) presents a systematic evaluation of instruction-tuned LLMs across three notable open essay-scoring datasets: ASAP 2.0, ELLIPSE, and DREsS. This research focuses on both holistic and analytic scoring, aiming to uncover the nuances of model performance in relation to human raters.
The study employs a comprehensive analysis to evaluate the agreement between LLM scores and human consensus scores, as well as the presence and stability of directional bias in scoring. The findings reveal that strong open-weight models demonstrate moderate to high agreement with human raters on holistic scoring, achieving a Quadratic Weighted Kappa of approximately 0.6. However, this level of agreement does not translate uniformly to analytic scoring, which presents a more complex scenario.
Key Findings
- Holistic vs. Analytic Scoring: While the models perform well in holistic scoring, the results are less favorable in analytic scoring.
- Directional Bias: A significant observation is the large and stable negative directional bias on Lower-Order Concern (LOC) traits, such as Grammar and Conventions. This indicates that LLMs often assign harsher scores to these traits compared to human raters.
- Prompt Efficiency: The research highlights that concise, keyword-based prompts tend to outperform longer, rubric-style prompts when it comes to multi-trait analytic scoring.
- Sample Size Considerations: To assess the data requirements for detecting systematic deviations in scoring, the authors compute the minimum sample size necessary for achieving a 95% bootstrap confidence interval that excludes zero for mean bias.
Implications for Educational Assessment
The implications of these findings are significant for the deployment of LLMs in educational settings. The results suggest a bias-correction-first strategy may be most effective. Instead of relying solely on raw zero-shot scores, the study advocates for the estimation and correction of systematic score offsets through the use of small human-labeled bias-estimation sets. This approach does not necessitate large-scale fine-tuning, making it a more practical solution for educators and institutions.
In conclusion, while LLMs show promise in supporting essay scoring, the study emphasizes the need for careful consideration of their biases, particularly in relation to LOC traits. By adopting strategies that prioritize bias correction, educators can enhance the reliability and fairness of AI-assisted assessments, ultimately benefiting students and educators alike.
