Evaluating Visual Prompts with Eye-Tracking Data for MLLM-Based Human Activity Recognition
Summary: arXiv:2604.09585v1 Announce Type: cross
Abstract
Large Language Models (LLMs) have emerged as foundation models for Internet of Things (IoT) applications such as human activity recognition (HAR). However, directly applying high-frequency and multi-dimensional sensor data, such as eye-tracking data, leads to information loss and high token costs. To mitigate this, we investigate a visual prompting strategy that transforms sensor signals into data visualization images as an input to multimodal LLMs (MLLMs) using eye-tracking data.
Introduction
The integration of Artificial Intelligence (AI) into various domains has significantly advanced the capabilities of human activity recognition systems. Among the diverse data types available, eye-tracking data stands out due to its precision in capturing human attention and behavior. However, utilizing this data within MLLMs presents challenges including data dimensionality, tokenization costs, and information fidelity. This article explores a novel approach to harness eye-tracking data through visual prompts, specifically tailored for MLLM-based HAR.
Methodology
In our research, we systematically evaluated the efficacy of visual prompting techniques across three publicly available eye-tracking datasets. Our approach involved converting raw sensor data into three distinct types of visualizations:
- Timeline Visualizations: These representations depict the progression of gaze over time, allowing for an interpretation of attention shifts.
- Heatmaps: By illustrating areas of focus with varying intensity, heatmaps provide insights into where users direct their attention most frequently.
- Scanpaths: This type of visualization traces the paths of gaze movement, offering a comprehensive view of visual exploration behavior.
Findings
The evaluation focused on the effectiveness of these visualizations under different temporal window sizes. Our findings indicate that:
- Visual prompting significantly reduces token costs while preserving critical information from eye-tracking data.
- MLLMs exhibited improved reasoning capabilities when provided with visual representations, as opposed to raw sensor data.
- The choice of visualization type influenced model performance, with heatmaps yielding particularly strong results in recognizing human activities.
Conclusion
This study demonstrates the potential of visual prompting strategies in enhancing the integration of eye-tracking data within MLLM frameworks for human activity recognition. The findings suggest that transforming high-frequency sensor signals into visual formats not only mitigates information loss but also fosters scalable and efficient representations essential for IoT applications. Future research will aim to further refine these methodologies and explore additional visualization techniques to optimize MLLM performance in diverse contexts.
Implications for Future Research
As the field of AI and IoT continues to evolve, our research highlights the importance of innovative approaches for data representation. By exploring the intersection of visual analytics and machine learning, we can unlock new avenues for enhancing human-computer interaction, ultimately paving the way for more intelligent and responsive systems.
