Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback
Summary: arXiv:2511.08225v2 Announce Type: replace-cross
Abstract: As teachers increasingly turn to GenAI in their educational practice, we need robust methods to benchmark large language models (LLMs) for pedagogical purposes. This article presents an embedding-based benchmarking framework to detect bias in LLMs in the context of formative feedback.
Using 600 authentic student essays from the AES 2.0 corpus, we constructed controlled counterfactuals along two dimensions:
- Implicit cues: lexicon-based swaps of gendered terms within essays.
- Explicit cues: gendered author background in the prompt.
We investigated six representative LLMs:
- GPT-5 mini
- GPT-4o mini
- DeepSeek-R1
- DeepSeek-R1-Qwen
- Gemini 2.5 Pro
- Llama-3-8B
Initially, we quantified the response divergence with cosine and Euclidean distances over sentence embeddings and then assessed significance via permutation tests. Finally, we visualized structure using dimensionality reduction techniques.
Our findings indicated that in all models, implicit manipulations reliably induced larger semantic shifts for male-female counterfactuals than for female-male. Notably, only the GPT and Llama models demonstrated sensitivity to explicit gender cues. These results suggest that even state-of-the-art LLMs exhibit asymmetric semantic responses to gender substitutions, indicating persistent gender biases in the feedback provided to learners.
Qualitative analyses further revealed notable linguistic differences in feedback. For example, feedback under male cues tended to be more autonomy-supportive, while feedback under female cues was often more controlling. This raises critical questions about the implications of using LLMs in educational contexts, especially regarding fairness and equity.
In light of these findings, we discuss several implications for fairness auditing of pedagogical GenAI:
- Establishing comprehensive reporting standards for counterfactual evaluation in learning analytics.
- Developing practical guidance for prompt design and deployment to ensure equitable feedback.
- Encouraging educators to critically assess the outputs of LLMs in their practice and to be aware of potential biases.
As the integration of AI in education becomes more prevalent, it is imperative that educators and researchers collaborate to create frameworks that not only benchmark LLMs for effectiveness but also ensure that these tools promote fairness and equity in student learning experiences. The findings of this study serve as a call to action for continued research and development in the field of educational AI, reinforcing the necessity of vigilance against embedded biases that could affect student outcomes.
