Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics
The recent paper titled “Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics” published on arXiv (2604.26607v1) addresses a pressing issue in the education sector: the transition from traditional marks-based assessments to competency-based education (CBE). This shift presents significant challenges for educators who must adapt to a more qualitative approach to student evaluation.
The authors propose a novel “Human-in-the-Loop” benchmarking framework to evaluate the effectiveness of various large language models (LLMs) in automating assessments for secondary-level mathematics, specifically focusing on the Grade 10 Optional Mathematics curriculum in Nepal. This approach aims to alleviate the manual burden placed on educators while ensuring the reliability and accuracy of competency assessments.
Key Components of the Study
The study developed a multi-dimensional rubric that encompasses four main topics and four cross-cutting competencies:
- Comprehension
- Knowledge
- Operational Fluency
- Behavior and Correlation
This rubric serves as a foundational tool for evaluating the performance of different LLMs in the context of competency assessment.
Multi-Provider Ensemble Evaluation
The evaluation included a multi-provider ensemble comprising both open-weight and proprietary models. The models assessed were:
- Eagle (Llama 3.1-8B)
- Orion (Llama 3.3-70B)
- Nova (Gemini 2.5 Flash)
- Lyra (Gemini 3 Pro)
These models were benchmarked against a ground truth established by two senior mathematics faculty members, achieving a high inter-rater reliability with a kappa score of 0.8652. This score indicates a strong agreement among the faculty regarding the competency assessments, providing a solid basis for comparison with the LLM outputs.
Findings and Implications
The results of the benchmarking revealed a significant “Architecture-compatibility gap.” Notably, while the Gemini-based Mixture-of-Experts (Sparse MoE) models achieved a “Fair Agreement” with a kappa of approximately 0.38, the larger Orion model, despite its greater scale, displayed “No Agreement” with a kappa of -0.0261. This outcome suggests that adherence to instructional constraints is more critical than the sheer scale of model parameters when it comes to performing tasks governed by specific rubrics.
The authors conclude that while current LLMs are not yet ready for autonomous certification of student competencies, they can offer substantial assistive support within a “Human-in-the-Loop” framework. This framework allows educators to leverage the strengths of LLMs in preliminary evidence extraction while maintaining oversight and judgment in final assessments.
Future Directions
This research underscores the need for ongoing development and refinement of LLMs to align better with educational assessment requirements. As competency-based education continues to evolve, it is essential for technology to adapt accordingly, ensuring that tools used in the classroom enhance, rather than hinder, the educational experience.
In summary, the study highlights both the potential and the limitations of LLMs in the realm of educational assessments, paving the way for further exploration into their integration within traditional education systems.
Related AI Insights
- LLMs in Legal Decisions: Impact of Persuadability Explored
- Value Alignment Tax: Quantifying Trade-offs in LLMs
- Agent-Aided Design for Dynamic 3D CAD Assemblies
- Hierarchical Multi-Persona Induction from Behavioral Logs
- Onchain Language-Model Agents: Operating Controls & Trading
- SoftBank’s Robotics Data Center Firm Eyes $100B IPO
- ClimAgent: Autonomous LLM Framework for Climate Analysis
- CURE-Med: Advanced Multilingual Medical Reasoning AI
- Apriori Analysis of Learned Helplessness in Math Tutoring
- Grounding vs Compositionality in Neuro-Symbolic AI Systems
