The Geometry of Compromise: Unlocking Generative Capabilities via Controllable Modality Alignment
Summary: arXiv:2604.00279v1 Announce Type: cross
Abstract: Vision-Language Models (VLMs) such as CLIP learn a shared embedding space for images and text, yet their representations remain geometrically separated, a phenomenon known as the modality gap. This gap limits tasks requiring cross-modal interchangeability, such as captioning and joint clustering. Existing post-processing approaches can partially improve cross-modal compatibility; however, we show through geometric analysis that they primarily reduce the global centroid offset while leaving the underlying distributional mismatch intact.
In this article, we delve into the complex geometrical relationship between modalities in Vision-Language Models, presenting a detailed analysis that reveals the true nature of the modality gap and its implications for cross-modal tasks.
Understanding the Modality Gap
The modality gap can be decomposed into two critical components:
- Centroid Gap: Refers to the offset between the centroids of the image and text embeddings in the shared space.
- Distribution Gap: Represents the disparity in the underlying distribution of embeddings, which is a more significant predictor of performance in cross-modal tasks.
Our research demonstrates that while existing methods focus on reducing the Centroid Gap, they often overlook the Distribution Gap. This oversight can lead to misleading conclusions regarding the effectiveness of cross-modal alignment. In our analysis, we found that:
- The Distribution Gap is the true predictor of cross-modal task quality, with an impressive $R^2 = 0.986$.
- Conversely, the commonly used Raw Gap fails to accurately reflect task performance, yielding an $R^2 = 0.691$.
Introducing TPC-CMA
Motivated by our findings, we propose a novel framework known as TPC-CMA (Three-Phase Curriculum for Cross-Modal Alignment). This fine-tuning approach explicitly targets both the Centroid Gap and the Distribution Gap, leading to more effective cross-modal alignment. Key features of TPC-CMA include:
- Joint Mitigation: The framework concurrently addresses centroid offsets and reshapes the distributional structure of the embeddings.
- Three-Phase Curriculum: A gradient-aware scheduling technique is employed to progressively introduce alignment during training, facilitating stable optimization.
Experimental Results
Our experiments reveal the efficacy of the TPC-CMA framework. With a target alignment parameter of $\alpha_{\text{target}}{=}0.05$, we achieved a remarkable reduction of 66.6% in the modality gap with only a 4.84% drop in accuracy. Under enhanced alignment conditions ($\alpha_{\text{target}}{=}0.5$), the gap reduction reached 82.3%, leading to significant improvements in performance metrics:
- Clustering Adjusted Rand Index (ARI) improved from 0.318 to 0.516.
- Captioning CIDEr scores increased by 57.1% over the original model.
Our code and pre-trained models will be made publicly available upon acceptance, ushering in a new era of cross-modal capabilities in Vision-Language Models.
