Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation
Summary: arXiv:2604.10720v1 Announce Type: new
Abstract: Artificial models that simulate how learners act and respond within educational systems are a promising tool for evaluating tutoring strategies and feedback mechanisms at scale. However, many existing approaches in programming education rely on prompting large, proprietary language models, raising concerns around privacy, cost, and dependence. In this work, we propose a method for training open-weight artificial programming learners using authentic student process data. Our approach serializes temporal log traces into a conversational format, representing each student’s problem-solving process as a dialogue between the learner and their automated assessment system.
Student code submissions and environment feedback, such as test outcomes, grades, and error traces, form alternating conversational turns, enabling models to learn from the iterative debugging process. We additionally introduce a training pipeline combining supervised fine-tuning with preference optimization to align models with authentic student debugging behavior.
Methodology
Our innovative framework leverages a large-scale dataset of real student submissions to Python programming assignments. The training process encompasses several key elements:
- Conversational Serialization: We convert temporal log traces of student submissions into a structured dialogue format. This dialogue mimics the interaction between a learner and an automated assessment system, capturing the nuances of student thought processes.
- Feedback Integration: The alternating conversational turns include not just student submissions but also environmental feedback, such as error messages and grading outcomes. This integration allows models to simulate a realistic debugging environment.
- Model Training: Our training pipeline utilizes a combination of supervised fine-tuning and preference optimization. This dual approach ensures that our models not only learn to code but also align closely with how actual students debug their code.
Results
We evaluated our framework by training Qwen models at both 4B and 8B scales. The results were promising:
- Incorporating environment feedback significantly enhanced the models’ ability to replicate student debugging behavior.
- Our models demonstrated improved functional alignment and code similarity compared to previous code-only approaches.
- They also outperformed prompted large language models, providing a more authentic simulation of student coding practices.
Conclusion
Our research indicates that using authentic student process data can lead to significant advancements in how language models are trained for programming education. By utilizing conversational serialization, we can create models that not only understand coding but also mimic the iterative process students undergo while debugging. This approach addresses critical concerns regarding privacy and dependence on proprietary models, paving the way for more accessible and effective educational tools.
We are committed to supporting reproducibility in our research by releasing our code, enabling other researchers and educators to build upon our findings and contribute to the evolving field of artificial intelligence in education.
