Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation
In a recent study published on arXiv (arXiv:2603.26330v1), researchers have investigated the challenges associated with supervised fine-tuning (SFT) of vision-language models (VLMs). While SFT is known to enhance perceptual capabilities, it often inadvertently leads to a decline in reasoning performance, a phenomenon termed the “reasoning tax.” This article explores the findings of the study and introduces a novel approach aimed at addressing this issue.
The Reasoning Tax in Vision-Language Models
The degradation of reasoning performance during post-training fine-tuning has emerged as a significant challenge in the development of VLMs. The researchers hypothesized that this decline may be linked to disrupted access to depth-wise representations within the model architecture.
Key Findings
-
Through their investigation, the researchers discovered that even a fixed cross-depth aggregation technique could significantly restore reasoning capabilities.
-
This finding suggests that maintaining access to cross-depth representations is a critical factor that has been overlooked in traditional VLM fine-tuning approaches.
Introduction of Input-Adaptive Depth Aggregation (IADA)
Building upon their initial findings, the researchers introduced a novel mechanism known as Input-Adaptive Depth Aggregation (IADA). This approach is designed to enhance the cross-depth retrieval process by making it:
- Input-Adaptive: Adjusts based on the specific input data, allowing for more tailored processing.
- Modality-Aware: Considers the different modalities involved, improving the integration of visual and textual information.
- Efficiently Parameterized: Utilizes a low-rank bottleneck to keep the additional parameters minimal, ensuring computational efficiency.
Performance Improvements with IADA
The researchers conducted experiments using the Qwen3-VL-2B model to evaluate the effectiveness of IADA. The results were promising:
- Average Reasoning Score Improvement: IADA yielded an impressive increase of 9.5 points.
- Average Perception Score Improvement: There was also a notable enhancement of 3.3 points.
- Parameter Efficiency: All these improvements were achieved with only an additional 0.14 million parameters, demonstrating the method’s efficiency.
Conclusion
The introduction of Input-Adaptive Depth Aggregation represents a significant step forward in mitigating the reasoning tax associated with fine-tuning vision-language models. By preserving cross-depth access and enhancing the adaptability of the model, researchers have opened new avenues for improving both reasoning and perception capabilities in VLMs. As the field continues to evolve, such innovative approaches will play a crucial role in advancing the capabilities of AI systems.
