Exploring Data Augmentation and Resampling Strategies for Transformer-Based Models to Address Class Imbalance in AI Scoring of Scientific Explanations in NGSS Classroom
Summary: arXiv:2604.19754v1 Announce Type: new
Abstract
Automated scoring of students’ scientific explanations offers the potential for immediate, accurate feedback, yet class imbalance in rubric categories, particularly those capturing advanced reasoning, remains a challenge. This study investigates augmentation strategies to improve transformer-based text classification of student responses to a physical science assessment based on an NGSS-aligned learning progression. The dataset consists of 1,466 high school responses scored on 11 binary-coded analytic categories. This rubric identifies six important components including scientific ideas needed for a complete explanation along with five common incomplete or inaccurate ideas.
Methodology
Using SciBERT as a baseline, we applied fine-tuning and tested various augmentation strategies:
- GPT-4 generated synthetic responses
- EASE, a word-level extraction and filtering approach
- ALP (Augmentation using Lexicalized Probabilistic context-free grammar) for phrase-level extraction
Results
Fine-tuning SciBERT improved recall over baseline; however, augmentation strategies substantially enhanced performance. Notably, GPT-4 data boosted both precision and recall metrics, while ALP achieved perfect precision, recall, and F1 scores across most severe imbalanced categories (5, 6, 7, and 9). Across all rubric categories, EASE augmentation significantly increased alignment with human scoring for both scientific ideas (Categories 1-6) and inaccurate ideas (Categories 7-11).
Comparison with Traditional Methods
We compared different augmentation strategies to a traditional oversampling method, known as SMOTE, in an effort to avoid overfitting and retain novice-level data critical for learning progression alignment. Findings demonstrate that targeted augmentation can effectively address severe imbalance while preserving conceptual coverage, offering a scalable solution for automated learning progression-aligned scoring in science education.
Conclusion
This study underscores the importance of innovative data augmentation techniques in enhancing the performance of transformer-based models in educational contexts. By addressing class imbalance, these strategies facilitate more accurate assessments of student learning, particularly in the realm of scientific reasoning. As educational institutions increasingly rely on automated systems for feedback, the findings presented here pave the way for more equitable and effective educational assessments.
