Scalable and Explainable Learner-Video Interaction Prediction using Multimodal Large Language Models
arXiv:2604.04482v1 | Announce Type: new
Abstract
Learners’ use of video controls in educational videos provides implicit signals of cognitive processing
and instructional design quality. However, the lack of scalable and explainable predictive models limits
instructors’ ability to anticipate such behavior before deployment. To address this challenge, we propose
a scalable, interpretable pipeline for predicting population-level watching, pausing, skipping, and
rewinding behavior as proxies for cognitive load based solely on video content.
Introduction
In the realm of online education, understanding learner interactions with video content is crucial.
These interactions serve as indicators of cognitive engagement and instructional effectiveness.
Traditional methods of analysis often fall short, as they fail to provide timely and interpretable insights
into learner behavior. This article presents a novel approach utilizing multimodal large language models
(MLLMs) to enhance prediction accuracy and provide explanations for interactions observed in educational videos.
Methodology
Our approach leverages multimodal large language models to compute embeddings of short video segments
and trains a neural classifier to identify temporally fine-grained interaction peaks. The methodology
is built on the following components:
- Video Segmentation: Short segments of educational videos are isolated for analysis.
- Embedding Computation: MLLMs generate embeddings that capture the semantic and visual
features of each segment. - Neural Classification: A classifier is trained to predict learner interactions based
on the embeddings obtained. - Feature Coding: Features of the video segments are coded using GPT-5, facilitating
model interpretation through concept activation vectors.
Evaluation
Our pipeline was evaluated on a substantial dataset comprising 77 million video control events from 66
online courses. The findings reveal several key insights:
- Classifiers based on MLLM embeddings consistently predicted interaction peaks with high accuracy.
- The model demonstrated a strong capacity to generalize across previously unseen academic fields.
- Encoded features were interpretable, aligning with instructional concepts rooted in multimedia learning theory.
Conclusion
The results of our study highlight the feasibility of implementing cost-efficient and interpretable
pre-screening of educational video design. This innovative approach not only enhances the understanding of
learner interactions but also opens new avenues for empirically examining multimedia learning theory at scale.
By bridging the gap between instructional design and predictive analytics, we aim to empower educators in
optimizing their video content for improved learner engagement and cognitive processing.
