MELT: Improve Composed Image Retrieval via the Modification Frequentation-Rarity Balance Network
Summary: arXiv:2603.29291v1 Announce Type: cross
Abstract
Composed Image Retrieval (CIR) leverages a reference image combined with modification text to retrieve target images that adequately reflect the modifications specified in the textual instructions. However, existing CIR methodologies face significant limitations that hinder their overall effectiveness. These limitations include:
- Frequency Bias: This leads to a phenomenon known as “Rare Sample Neglect,” where infrequently represented modifications are often overlooked.
- Susceptibility to Interference: Similarity scores can be adversely affected by hard negative samples and noise, complicating the retrieval process.
Challenges in Composed Image Retrieval
To effectively tackle these challenges, we identify two primary issues that need addressing:
- Asymmetric Rare Semantic Localization: It is crucial to accurately identify and prioritize rare semantic modifications within the multimodal context.
- Robust Similarity Estimation: The need for reliable similarity scores becomes paramount, particularly in the presence of hard negative samples.
Introducing MELT
To resolve the aforementioned challenges, we introduce the Modification frEquentation-rarity baLance neTwork, abbreviated as MELT. This innovative framework is designed to enhance the performance of CIR through the following mechanisms:
- Increased Attention to Rare Modifications: MELT strategically assigns greater focus to rare modification semantics, ensuring that these crucial components are not overlooked during the retrieval process.
- Diffusion-based Denoising: By applying diffusion-based denoising techniques, MELT effectively mitigates the influence of hard negative samples that exhibit high similarity scores, thus refining the quality of similarity estimations.
- Enhanced Multimodal Fusion: The integration of various modalities is improved, resulting in a more coherent and effective matching process.
Experimental Validation
Extensive experiments conducted on two prominent CIR benchmarks demonstrate the superior performance of the MELT framework. The results indicate a significant improvement over existing methods, validating the efficacy of our proposed approach.
Conclusion
In conclusion, MELT represents a significant advancement in the field of composed image retrieval, addressing critical limitations associated with frequency bias and the influence of hard negative samples. Researchers and practitioners interested in exploring this innovative methodology can access the source code at GitHub Repository.
