Powerful Teachers Matter: Text-Guided Multi-view Knowledge Distillation with Visual Prior Enhancement
Summary: arXiv:2603.24208v1 Announce Type: cross
Abstract: Knowledge distillation transfers knowledge from large teacher models to smaller students for efficient inference. While existing methods primarily focus on distillation strategies, they often overlook the importance of enhancing teacher knowledge quality.
In this paper, we propose a novel approach called Text-guided Multi-view Knowledge Distillation (TMKD). This method leverages dual-modality teachers—a visual teacher and a text teacher (CLIP)—to provide richer supervisory signals. Our innovation enhances the visual teacher by incorporating multi-view inputs, which integrate visual priors such as edge and high-frequency features. Meanwhile, the text teacher generates semantic weights through prior-aware prompts, guiding adaptive feature fusion.
Key Components of TMKD
- Dual-modality Teachers: The integration of both visual and text teachers allows for a more comprehensive learning signal.
- Multi-view Inputs: By using various perspectives and features, the visual teacher’s knowledge is significantly enhanced.
- Prior-aware Prompts: These prompts help the text teacher to produce semantic weights, which are essential for guiding the student model’s learning process.
- Vision-language Contrastive Regularization: This technique aims to strengthen the semantic knowledge within the student model, ensuring that the features learned are more aligned with the intended semantics.
Experimental Validation
We conducted extensive experiments across five benchmarks to validate the effectiveness of our approach. The results demonstrated that TMKD consistently improves knowledge distillation performance by up to 4.49%. This significant improvement underscores the effectiveness of our dual-teacher multi-view enhancement strategy.
Conclusion
Our findings highlight the importance of not just the strategies used in knowledge distillation but also the quality of the knowledge being transferred. By enhancing the capabilities of teacher models through multi-view inputs and semantic guidance, we can achieve more efficient and reliable inference in smaller student models.
For those interested in exploring our work further, the code is available at this link.
