Benchmarking Educational LLMs for Gender Bias in Feedback

Date:

Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback

Summary: arXiv:2511.08225v2 Announce Type: replace-cross

Abstract: As teachers increasingly turn to GenAI in their educational practice, we need robust methods to benchmark large language models (LLMs) for pedagogical purposes. This article presents an embedding-based benchmarking framework to detect bias in LLMs in the context of formative feedback.

Using 600 authentic student essays from the AES 2.0 corpus, we constructed controlled counterfactuals along two dimensions:

  • Implicit cues: lexicon-based swaps of gendered terms within essays.
  • Explicit cues: gendered author background in the prompt.

We investigated six representative LLMs:

  • GPT-5 mini
  • GPT-4o mini
  • DeepSeek-R1
  • DeepSeek-R1-Qwen
  • Gemini 2.5 Pro
  • Llama-3-8B

Initially, we quantified the response divergence with cosine and Euclidean distances over sentence embeddings and then assessed significance via permutation tests. Finally, we visualized structure using dimensionality reduction techniques.

Our findings indicated that in all models, implicit manipulations reliably induced larger semantic shifts for male-female counterfactuals than for female-male. Notably, only the GPT and Llama models demonstrated sensitivity to explicit gender cues. These results suggest that even state-of-the-art LLMs exhibit asymmetric semantic responses to gender substitutions, indicating persistent gender biases in the feedback provided to learners.

Qualitative analyses further revealed notable linguistic differences in feedback. For example, feedback under male cues tended to be more autonomy-supportive, while feedback under female cues was often more controlling. This raises critical questions about the implications of using LLMs in educational contexts, especially regarding fairness and equity.

In light of these findings, we discuss several implications for fairness auditing of pedagogical GenAI:

  • Establishing comprehensive reporting standards for counterfactual evaluation in learning analytics.
  • Developing practical guidance for prompt design and deployment to ensure equitable feedback.
  • Encouraging educators to critically assess the outputs of LLMs in their practice and to be aware of potential biases.

As the integration of AI in education becomes more prevalent, it is imperative that educators and researchers collaborate to create frameworks that not only benchmark LLMs for effectiveness but also ensure that these tools promote fairness and equity in student learning experiences. The findings of this study serve as a call to action for continued research and development in the field of educational AI, reinforcing the necessity of vigilance against embedded biases that could affect student outcomes.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.