Single-agent vs. Multi-agents for Automated Video Analysis of On-Screen Collaborative Learning Behaviors
Summary: arXiv:2604.03631v1 Announce Type: new
Abstract: On-screen learning behavior provides valuable insights into how students seek, use, and create information during learning. Analyzing on-screen behavioral engagement is essential for capturing students’ cognitive and collaborative processes. The recent development of Vision Language Models (VLMs) offers new opportunities to automate the labor-intensive manual coding often required for multimodal video data analysis.
In this study, we compared the performance of both leading closed-source VLMs (Claude-3.7-Sonnet, GPT-4.1) and an open-source VLM (Qwen2.5-VL-72B) in single- and multi-agent settings for automated coding of screen recordings in collaborative learning contexts based on the ICAP framework. In particular, we proposed and compared two multi-agent frameworks:
- Three-agent workflow multi-agent system (MAS): This system segments screen videos by scene and detects on-screen behaviors using cursor-informed VLM prompting with evidence-based verification.
- Autonomous-decision MAS: Inspired by ReAct, this system iteratively interleaves reasoning, tool-like operations (segmentation, classification, validation), and observation-driven self-correction to produce interpretable on-screen behavior labels.
Experimental results demonstrated that the two proposed MAS frameworks achieved viable performance, outperforming the single VLMs in scene and action detection tasks. It is worth noting that:
- The workflow-based agent achieved the best performance in scene detection.
- The autonomous-decision MAS excelled in action detection.
This study highlights the effectiveness of VLM-based Multi-agent Systems for video analysis and contributes a scalable framework for multimodal data analytics. The implications of these findings extend beyond mere academic interest, suggesting practical applications in educational technology, collaborative learning environments, and automated assessment tools.
As educational institutions increasingly integrate technology into learning environments, understanding student behavior through video analysis will be paramount. The ability to automate this process not only saves time and resources but also enhances the accuracy and reliability of data collected. The use of multi-agent systems, as demonstrated in this study, offers a promising avenue for future research and development in the field of educational analytics.
In conclusion, the transition from single-agent to multi-agent systems in video analysis represents a significant advancement in the field. The combination of multiple agents working collaboratively allows for a more nuanced understanding of on-screen behaviors, ultimately contributing to improved educational practices and learner outcomes.
