Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots
In recent years, the rapid adoption of large language models (LLMs) in educational settings has introduced profound challenges for assessment design. As educators increasingly integrate LLM-based tools, it is essential to adapt assessments to the unique capabilities and limitations of these models. Current evaluations of LLMs primarily rely on descriptive statistics derived from benchmark tests, raising concerns about their effectiveness in supporting assessment design. This article discusses a novel approach that combines educational data mining with psychometric theory to address these challenges.
Understanding the Challenges
As LLMs become more prevalent, it is vital to characterize their strengths and weaknesses in a way that is generalizable, valid, and reliable. However, existing research has largely overlooked the application of theory-grounded measurement methods to analyze LLM capabilities in comparison to human learners. This gap highlights the need for a systematic approach to assess where LLMs may outperform or underperform in comparison to human responses.
A Statistically Principled Approach
To fill this gap, researchers have introduced a method based on Differential Item Functioning (DIF) analysis, which is traditionally used to detect bias across demographic groups. The new approach integrates negative control analysis and item-total correlation discrimination analysis. By employing this method, researchers can identify specific assessment items on which humans and LLMs exhibit systematic response differences. This identification process serves two primary purposes:
- To pinpoint areas where assessments may be most vulnerable to AI misuse.
- To determine which task dimensions render problems particularly easy or difficult for generative AI.
Evaluation of the Method
The effectiveness of this method was evaluated using responses from human learners and six leading chatbots, including ChatGPT-4o, ChatGPT-5.2, Gemini 1.5, Gemini 3 Pro, Claude 3.5, and Claude 4.5 Sonnet. The analysis was conducted on two distinct instruments: a high school chemistry diagnostic test and a university entrance exam. Subject-matter experts subsequently reviewed the DIF-flagged items to characterize the task dimensions associated with chatbot over- or under-performance.
Key Findings
The results demonstrated that DIF-informed analytics provide a robust framework for understanding the divergences between LLM and human capabilities. This understanding is crucial for enhancing the design of assessments that are valid, reliable, and fair in the context of AI. The study’s findings underline the importance of adapting educational assessments to account for the unique characteristics of LLMs, thereby ensuring that evaluations remain effective and equitable.
Conclusion
As the integration of AI tools in education continues to evolve, the proposed method represents a significant step toward developing assessments that are resilient to the challenges posed by LLMs. By leveraging DIF analysis and other statistical methods, educators can design assessments that not only measure learning outcomes effectively but also maintain fairness and integrity in the AI era.
