Hierarchical Pre-Training of Vision Encoders with Large Language Models
Summary: arXiv:2604.00086v1 Announce Type: cross
Abstract
The field of computer vision has experienced significant advancements through scalable vision encoders and multimodal pre-training frameworks. However, existing approaches often treat vision encoders and large language models (LLMs) as independent modules, limiting the integration of hierarchical visual features.
Introduction to HIVE
In this work, we propose HIVE (Hierarchical Pre-Training of Vision Encoders), a novel framework that enhances vision-language alignment by introducing hierarchical cross-attention between the vision encoder and LLM. Unlike conventional methods that flatten image embeddings, HIVE enables structured feature fusion across multiple layers, improving gradient flow and representation learning.
Training Strategy
To optimize the interaction between vision encoders and LLMs, we introduce a three-stage training strategy:
- Stage One: Initial alignment of the vision encoder with the LLM using basic cross-attention mechanisms.
- Stage Two: Progressive refinement of feature interactions, focusing on enhancing the hierarchical structure of the visual data.
- Stage Three: Final optimization to ensure stable training and effective multimodal fusion.
This structured approach ensures that the model can adaptively learn from both visual and textual modalities, leading to better performance in a variety of tasks.
Empirical Evaluations
Our empirical evaluations demonstrate that HIVE achieves superior performance not only in image classification but also on various vision-language tasks. The framework outperforms self-attention-based methods in benchmarks such as:
- MME (Multimodal Evaluation)
- GQA (Generalized Question Answering)
- OK-VQA (Open-Ended Visual Question Answering)
- ScienceQA (Science Question Answering)
The results highlight the benefits of hierarchical feature integration, paving the way for more efficient and expressive vision-language models.
Conclusion
The introduction of HIVE signifies a crucial step forward in the integration of vision and language processing. By leveraging hierarchical features and structured attention mechanisms, our approach not only enhances the understanding of visual content but also improves the overall performance of multimodal tasks. As the field continues to evolve, frameworks like HIVE will be essential in driving further advancements in AI, ultimately leading to more sophisticated applications in computer vision and natural language processing.
