Reliable LLM-Assisted Rubric Scoring for Physics Exams

Date:

Designing Reliable LLM-Assisted Rubric Scoring for Constructed Responses: Evidence from Physics Exams

Summary: arXiv:2604.12227v1 Announce Type: new

Abstract: Student responses in STEM assessments are often handwritten and combine symbolic expressions, calculations, and diagrams, creating substantial variation in format and interpretation. Despite their importance for evaluating students’ reasoning, such responses are time-consuming to score and prone to rater inconsistency, particularly when partial credit is required.

Recent advances in large language models (LLMs) have increased attention to AI-assisted scoring, yet evidence remains limited regarding how rubric design and LLM configurations influence reliability across performance levels. This study examined the reliability of AI-assisted scoring of undergraduate physics constructed responses using GPT-4o. Twenty authentic handwritten exam responses were scored across two rounds by four instructors and by the AI model using skill-based rubrics with differing levels of analytic granularity.

Research Methodology

Prompting format and temperature settings were systematically varied. Key findings include:

  • Overall, human-AI agreement on total scores was comparable to human inter-rater reliability.
  • Agreement was highest for high- and low-performing responses but declined for mid-level responses involving partial or ambiguous reasoning.
  • Criterion-level analyses showed stronger alignment for clearly defined conceptual skills than for extended procedural judgments.
  • A more fine-grained, checklist-based rubric improved consistency relative to holistic scoring.

Findings and Implications

These findings indicate that reliable AI-assisted scoring depends primarily on clear, well-structured rubrics. The study provides several recommendations for educators and developers:

  • Utilize skill-based rubrics that clearly define evaluation criteria.
  • Implement checklist-based scoring systems to enhance consistency.
  • Focus less on prompting format and temperature settings, as these factors have a secondary influence on scoring reliability.

Conclusion

Overall, the study offers transferable design recommendations for implementing reliable LLM-assisted scoring in STEM contexts. As the role of AI in education continues to grow, understanding the interaction between rubric design and AI scoring mechanisms is crucial for improving assessment practices.

Future research should explore the scalability of these findings across different subjects and educational levels, as well as the potential for integrating AI scoring systems into existing assessment frameworks. The ultimate goal remains to enhance the accuracy and efficiency of scoring in educational settings, thereby supporting the learning outcomes of students in STEM disciplines.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.