Evaluating LLMs for Student Q&A in Intro Programming

Evaluating LLMs for Answering Student Questions in Introductory Programming Courses

Summary: arXiv:2603.28295v1 Announce Type: new

Abstract

The rapid emergence of Large Language Models (LLMs) presents both opportunities and challenges for programming education.
While students increasingly use generative AI tools, direct access often hinders the learning process by providing complete
solutions rather than pedagogical hints. Concurrently, educators face significant workload and scalability challenges when
providing timely, personalized feedback. This study investigates the capabilities of LLMs to safely and effectively assist
educators in answering student questions within a CS1 programming course.

Introduction

In recent years, the integration of AI technologies in education has become increasingly prevalent.
As students turn to generative AI tools for assistance, there is a growing concern regarding the implications
of such tools on learning outcomes, particularly in introductory programming courses.

Research Methodology

To achieve our objectives, we established a rigorous, reproducible evaluation process.
This involved curating a benchmark dataset consisting of 170 authentic student questions sourced from a learning management system.
Each question was paired with ground-truth responses authored by subject matter experts to ensure accuracy and relevance.

Evaluation Metrics

Traditional text-matching metrics are often insufficient for evaluating open-ended educational responses.
To address this limitation, we developed and validated a custom LLM-as-a-Judge metric designed specifically for assessing
pedagogical accuracy. This metric allows for a more nuanced evaluation of LLM responses in the context of educational effectiveness.

Findings

Our findings indicate that advanced models, such as Gemini 3 flash, can surpass the quality baseline of typical educator responses.
These models achieved a high alignment with expert pedagogical standards, demonstrating their potential as effective educational tools.

Challenges and Recommendations

Despite these promising results, our study also highlights persistent risks associated with LLMs, such as hallucination—where the model generates incorrect or nonsensical answers.
To mitigate these risks and ensure alignment with course-specific contexts, we advocate for a “teacher-in-the-loop” implementation strategy.
This approach leverages human oversight to enhance the reliability and accuracy of AI-generated responses.

Conclusion

In conclusion, our research underscores the importance of a structured evaluation framework for educational LLM tools.
By abstracting our methodology into a task-agnostic evaluation framework, we advocate for a shift in the development of educational LLM tools.
This shift should move from ad-hoc, post-deployment testing to a quantifiable, pre-deployment validation process.
Such a transition will ensure that the integration of AI in education maximizes benefits while minimizing potential drawbacks.

Future Directions

As the landscape of educational technology continues to evolve, ongoing research and development will be essential.
Future work should focus on refining evaluation metrics and exploring diverse applications of LLMs in various educational contexts.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Evaluating LLMs for Student Q&A in Intro Programming

Evaluating LLMs for Answering Student Questions in Introductory Programming Courses

Abstract

Introduction

Research Methodology

Evaluation Metrics

Findings

Challenges and Recommendations

Conclusion

Future Directions

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related