Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval
Summary: arXiv:2604.15735v1 Announce Type: cross
Abstract: Fine-grained image retrieval via hand-drawn sketches or textual descriptions remains a critical challenge due to inherent modality gaps. While hand-drawn sketches capture complex structural contours, they lack color and texture, which text effectively provides despite omitting spatial contours. Motivated by the complementary nature of these modalities, we propose the Sketch and Text Based Image Retrieval (STBIR) framework. By synergizing the rich color and texture cues from text with the structural outlines provided by sketches, STBIR achieves superior fine-grained retrieval performance.
Key Innovations of the STBIR Framework
The STBIR framework is built upon a series of innovative components aimed at addressing the challenges inherent in fine-grained image retrieval. Below are the core features of this approach:
- Robustness Enhancement Module: A curriculum learning driven robustness enhancement module is proposed to improve the model’s performance when handling queries of varying quality. This feature ensures reliable outputs, regardless of the input’s quality.
- Feature Space Optimization: The introduction of a category-knowledge-based feature space optimization module significantly boosts the model’s representational power. This optimization allows the framework to better understand and categorize the relationships between different image features.
- Cross-Modal Feature Alignment: A multi-stage cross-modal feature alignment mechanism is designed to effectively address the challenges of aligning features from sketches and textual descriptions. This mechanism is essential for ensuring that the complementary information from both modalities is utilized effectively.
Benchmark Dataset
To validate the efficacy of the STBIR framework, a fine-grained STBIR benchmark dataset has been meticulously curated. This dataset serves as a critical resource for researchers and practitioners, providing robust data support for subsequent related studies. The benchmark is designed to rigorously test the performance of the proposed framework against existing methods.
Experimental Results
Extensive experiments conducted on the STBIR framework indicate that it significantly outperforms current state-of-the-art methods in fine-grained image retrieval tasks. The results showcase the effectiveness of the proposed modules and the synergistic approach to combining sketch and text modalities.
Conclusion
The ongoing challenges in fine-grained image retrieval highlight the need for innovative solutions that leverage the strengths of different modalities. The STBIR framework represents a significant advancement in this field, demonstrating how the integration of sketch and text data can lead to improved retrieval performance. The findings from this research not only contribute to academic knowledge but also pave the way for practical applications in areas such as digital art, design, and content-based image retrieval systems.
For further reading, the full research paper can be accessed on arXiv under the identifier 2604.15735v1.
