GazeQwen: Lightweight Gaze Integration for Video AI

Date:

GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding

Recent advancements in multimodal large language models (MLLMs) have transformed the landscape of video understanding. However, a significant gap remains in effectively utilizing eye-gaze information within these models. The newly introduced GazeQwen seeks to fill this gap by integrating gaze awareness through a novel approach known as hidden-state modulation.

Overview of GazeQwen

Current MLLMs struggle to leverage eye-gaze data, even when provided through visual overlays or textual descriptions. GazeQwen addresses this limitation by employing a parameter-efficient mechanism that enhances an open-source MLLM’s capabilities. Central to this innovation is a compact gaze resampler, which comprises approximately 1-5 million trainable parameters. This resampler encodes video features from V-JEPA 2.1 alongside fixation-derived positional encodings, producing additive residuals that are injected into selected LLM decoder layers using forward hooks.

Training and Integration

An optional second training stage allows for the addition of low-rank adapters (LoRA) to the LLM, facilitating tighter integration between gaze information and the model’s processing capabilities. This two-stage approach enhances the model’s performance without necessitating large-scale increases in model size, marking a significant shift in how gaze information can be utilized in MLLMs.

Performance Evaluation

To assess the effectiveness of GazeQwen, the model was evaluated on all ten tasks of the StreamGaze benchmark. The results were impressive, with GazeQwen achieving an accuracy of 63.9%. This marks a substantial 16.1 point improvement over the baseline Qwen2.5-VL-7B model, which utilized gaze as visual prompts. Furthermore, GazeQwen outperformed GPT-4o by 10.5 points, securing the highest accuracy score among all tested open-source and proprietary models.

Implications and Future Directions

The findings suggest that strategically learning where to inject gaze information within a large language model is more effective than merely increasing the model’s size or refining prompt engineering. This insight is pivotal for the future development of MLLMs, indicating that targeted approaches may yield better results in multimodal contexts.

Conclusion

GazeQwen represents a significant advancement in the integration of gaze awareness within large language models, offering a lightweight, efficient solution for enhancing video understanding. As the field of AI continues to evolve, the methodologies developed through GazeQwen could pave the way for more sophisticated models that better comprehend the multimodal nature of human interaction and understanding.

Additional Resources

For those interested in exploring GazeQwen further, all code and checkpoints are available on GitHub at the following link:


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.