Benchmarking LLMs for Automated Math Competency Assessment

Date:

Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics

The recent paper titled “Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics” published on arXiv (2604.26607v1) addresses a pressing issue in the education sector: the transition from traditional marks-based assessments to competency-based education (CBE). This shift presents significant challenges for educators who must adapt to a more qualitative approach to student evaluation.

The authors propose a novel “Human-in-the-Loop” benchmarking framework to evaluate the effectiveness of various large language models (LLMs) in automating assessments for secondary-level mathematics, specifically focusing on the Grade 10 Optional Mathematics curriculum in Nepal. This approach aims to alleviate the manual burden placed on educators while ensuring the reliability and accuracy of competency assessments.

Key Components of the Study

The study developed a multi-dimensional rubric that encompasses four main topics and four cross-cutting competencies:

  • Comprehension
  • Knowledge
  • Operational Fluency
  • Behavior and Correlation

This rubric serves as a foundational tool for evaluating the performance of different LLMs in the context of competency assessment.

Multi-Provider Ensemble Evaluation

The evaluation included a multi-provider ensemble comprising both open-weight and proprietary models. The models assessed were:

  • Eagle (Llama 3.1-8B)
  • Orion (Llama 3.3-70B)
  • Nova (Gemini 2.5 Flash)
  • Lyra (Gemini 3 Pro)

These models were benchmarked against a ground truth established by two senior mathematics faculty members, achieving a high inter-rater reliability with a kappa score of 0.8652. This score indicates a strong agreement among the faculty regarding the competency assessments, providing a solid basis for comparison with the LLM outputs.

Findings and Implications

The results of the benchmarking revealed a significant “Architecture-compatibility gap.” Notably, while the Gemini-based Mixture-of-Experts (Sparse MoE) models achieved a “Fair Agreement” with a kappa of approximately 0.38, the larger Orion model, despite its greater scale, displayed “No Agreement” with a kappa of -0.0261. This outcome suggests that adherence to instructional constraints is more critical than the sheer scale of model parameters when it comes to performing tasks governed by specific rubrics.

The authors conclude that while current LLMs are not yet ready for autonomous certification of student competencies, they can offer substantial assistive support within a “Human-in-the-Loop” framework. This framework allows educators to leverage the strengths of LLMs in preliminary evidence extraction while maintaining oversight and judgment in final assessments.

Future Directions

This research underscores the need for ongoing development and refinement of LLMs to align better with educational assessment requirements. As competency-based education continues to evolve, it is essential for technology to adapt accordingly, ensuring that tools used in the classroom enhance, rather than hinder, the educational experience.

In summary, the study highlights both the potential and the limitations of LLMs in the realm of educational assessments, paving the way for further exploration into their integration within traditional education systems.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.