LLMs Do Not Grade Essays Like Humans
Summary: arXiv:2603.23714v1 Announce Type: new
Abstract
Large language models (LLMs) have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear. In this work, we evaluate how LLM-generated scores compare with human grades and analyze the grading behavior of several models from the GPT and Llama families in an out-of-the-box setting, without task-specific training. Our results show that agreement between LLM and human scores remains relatively weak and varies with essay characteristics.
Key Findings
Our analysis reveals several important trends in the grading behavior of LLMs:
- Score Discrepancies: Compared to human raters, LLMs tend to assign higher scores to short or underdeveloped essays.
- Impact of Essay Length: Conversely, longer essays that contain minor grammatical or spelling errors often receive lower scores from LLMs.
- Feedback Consistency: The scores generated by LLMs are generally consistent with the feedback they provide; essays receiving more praise tend to receive higher scores, while those receiving more criticism tend to score lower.
Analysis of Grading Behavior
Our study focused on comparing the grading outputs of various LLM models, specifically highlighting the GPT and Llama families. The evaluation was conducted without any task-specific training to ensure a fair comparison with human raters.
The results indicate that while LLMs can produce coherent feedback patterns, the underlying signals they rely on differ significantly from those utilized by human raters. This divergence results in limited alignment with traditional human grading practices, ultimately affecting the reliability of LLM-generated scores.
Implications for Automated Essay Scoring
The findings of this study carry important implications for the future of automated essay scoring systems:
- Limited Agreement: Educators and institutions should be cautious when relying solely on LLMs for essay grading, as the discrepancies in scoring may misrepresent student performance.
- Need for Calibration: There is a need for further research and potential calibration of LLMs to better align their scoring with human assessments.
- Supporting Role: Despite their limitations, LLMs can still serve as valuable tools in supporting the essay scoring process, especially in providing constructive feedback.
Conclusion
Our work highlights the complexities of using large language models for essay grading and emphasizes the necessity of understanding their limitations. While LLMs can provide feedback that is coherent and consistent with their grading, the lack of agreement with human scores suggests that they should be used carefully. Future developments in LLM technology may improve their alignment with human grading practices, but for now, they should be viewed as supplementary tools rather than replacements for human evaluators.
