Benchmarking LLMs for Automated Math Competency Assessment

Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics

The recent paper titled “Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics” published on arXiv (2604.26607v1) addresses a pressing issue in the education sector: the transition from traditional marks-based assessments to competency-based education (CBE). This shift presents significant challenges for educators who must adapt to a more qualitative approach to student evaluation.

The authors propose a novel “Human-in-the-Loop” benchmarking framework to evaluate the effectiveness of various large language models (LLMs) in automating assessments for secondary-level mathematics, specifically focusing on the Grade 10 Optional Mathematics curriculum in Nepal. This approach aims to alleviate the manual burden placed on educators while ensuring the reliability and accuracy of competency assessments.

Key Components of the Study

The study developed a multi-dimensional rubric that encompasses four main topics and four cross-cutting competencies:

Comprehension
Knowledge
Operational Fluency
Behavior and Correlation

This rubric serves as a foundational tool for evaluating the performance of different LLMs in the context of competency assessment.

Multi-Provider Ensemble Evaluation

The evaluation included a multi-provider ensemble comprising both open-weight and proprietary models. The models assessed were:

Eagle (Llama 3.1-8B)
Orion (Llama 3.3-70B)
Nova (Gemini 2.5 Flash)
Lyra (Gemini 3 Pro)

These models were benchmarked against a ground truth established by two senior mathematics faculty members, achieving a high inter-rater reliability with a kappa score of 0.8652. This score indicates a strong agreement among the faculty regarding the competency assessments, providing a solid basis for comparison with the LLM outputs.

Findings and Implications

The results of the benchmarking revealed a significant “Architecture-compatibility gap.” Notably, while the Gemini-based Mixture-of-Experts (Sparse MoE) models achieved a “Fair Agreement” with a kappa of approximately 0.38, the larger Orion model, despite its greater scale, displayed “No Agreement” with a kappa of -0.0261. This outcome suggests that adherence to instructional constraints is more critical than the sheer scale of model parameters when it comes to performing tasks governed by specific rubrics.

The authors conclude that while current LLMs are not yet ready for autonomous certification of student competencies, they can offer substantial assistive support within a “Human-in-the-Loop” framework. This framework allows educators to leverage the strengths of LLMs in preliminary evidence extraction while maintaining oversight and judgment in final assessments.

Future Directions

This research underscores the need for ongoing development and refinement of LLMs to align better with educational assessment requirements. As competency-based education continues to evolve, it is essential for technology to adapt accordingly, ensuring that tools used in the classroom enhance, rather than hinder, the educational experience.

In summary, the study highlights both the potential and the limitations of LLMs in the realm of educational assessments, paving the way for further exploration into their integration within traditional education systems.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Benchmarking LLMs for Automated Math Competency Assessment

Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics

Key Components of the Study

Multi-Provider Ensemble Evaluation

Findings and Implications

Future Directions

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related