Evaluating LLMs for Answering Student Questions in Introductory Programming Courses
Summary: arXiv:2603.28295v1 Announce Type: new
Abstract
The rapid emergence of Large Language Models (LLMs) presents both opportunities and challenges for programming education.
While students increasingly use generative AI tools, direct access often hinders the learning process by providing complete
solutions rather than pedagogical hints. Concurrently, educators face significant workload and scalability challenges when
providing timely, personalized feedback. This study investigates the capabilities of LLMs to safely and effectively assist
educators in answering student questions within a CS1 programming course.
Introduction
In recent years, the integration of AI technologies in education has become increasingly prevalent.
As students turn to generative AI tools for assistance, there is a growing concern regarding the implications
of such tools on learning outcomes, particularly in introductory programming courses.
Research Methodology
To achieve our objectives, we established a rigorous, reproducible evaluation process.
This involved curating a benchmark dataset consisting of 170 authentic student questions sourced from a learning management system.
Each question was paired with ground-truth responses authored by subject matter experts to ensure accuracy and relevance.
Evaluation Metrics
Traditional text-matching metrics are often insufficient for evaluating open-ended educational responses.
To address this limitation, we developed and validated a custom LLM-as-a-Judge metric designed specifically for assessing
pedagogical accuracy. This metric allows for a more nuanced evaluation of LLM responses in the context of educational effectiveness.
Findings
Our findings indicate that advanced models, such as Gemini 3 flash, can surpass the quality baseline of typical educator responses.
These models achieved a high alignment with expert pedagogical standards, demonstrating their potential as effective educational tools.
Challenges and Recommendations
Despite these promising results, our study also highlights persistent risks associated with LLMs, such as hallucination—where the model generates incorrect or nonsensical answers.
To mitigate these risks and ensure alignment with course-specific contexts, we advocate for a “teacher-in-the-loop” implementation strategy.
This approach leverages human oversight to enhance the reliability and accuracy of AI-generated responses.
Conclusion
In conclusion, our research underscores the importance of a structured evaluation framework for educational LLM tools.
By abstracting our methodology into a task-agnostic evaluation framework, we advocate for a shift in the development of educational LLM tools.
This shift should move from ad-hoc, post-deployment testing to a quantifiable, pre-deployment validation process.
Such a transition will ensure that the integration of AI in education maximizes benefits while minimizing potential drawbacks.
Future Directions
As the landscape of educational technology continues to evolve, ongoing research and development will be essential.
Future work should focus on refining evaluation metrics and exploring diverse applications of LLMs in various educational contexts.
