Why LLMs Struggle to Grade Essays Like Humans

Date:

LLMs Do Not Grade Essays Like Humans

Summary: arXiv:2603.23714v1 Announce Type: new

Abstract

Large language models (LLMs) have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear. In this work, we evaluate how LLM-generated scores compare with human grades and analyze the grading behavior of several models from the GPT and Llama families in an out-of-the-box setting, without task-specific training. Our results show that agreement between LLM and human scores remains relatively weak and varies with essay characteristics.

Key Findings

Our analysis reveals several important trends in the grading behavior of LLMs:

  • Score Discrepancies: Compared to human raters, LLMs tend to assign higher scores to short or underdeveloped essays.
  • Impact of Essay Length: Conversely, longer essays that contain minor grammatical or spelling errors often receive lower scores from LLMs.
  • Feedback Consistency: The scores generated by LLMs are generally consistent with the feedback they provide; essays receiving more praise tend to receive higher scores, while those receiving more criticism tend to score lower.

Analysis of Grading Behavior

Our study focused on comparing the grading outputs of various LLM models, specifically highlighting the GPT and Llama families. The evaluation was conducted without any task-specific training to ensure a fair comparison with human raters.

The results indicate that while LLMs can produce coherent feedback patterns, the underlying signals they rely on differ significantly from those utilized by human raters. This divergence results in limited alignment with traditional human grading practices, ultimately affecting the reliability of LLM-generated scores.

Implications for Automated Essay Scoring

The findings of this study carry important implications for the future of automated essay scoring systems:

  • Limited Agreement: Educators and institutions should be cautious when relying solely on LLMs for essay grading, as the discrepancies in scoring may misrepresent student performance.
  • Need for Calibration: There is a need for further research and potential calibration of LLMs to better align their scoring with human assessments.
  • Supporting Role: Despite their limitations, LLMs can still serve as valuable tools in supporting the essay scoring process, especially in providing constructive feedback.

Conclusion

Our work highlights the complexities of using large language models for essay grading and emphasizes the necessity of understanding their limitations. While LLMs can provide feedback that is coherent and consistent with their grading, the lack of agreement with human scores suggests that they should be used carefully. Future developments in LLM technology may improve their alignment with human grading practices, but for now, they should be viewed as supplementary tools rather than replacements for human evaluators.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.