CGC: Enhancing Fine-Grained Multi-Image Understanding

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

The rapid advancement of Multimodal Large Language Models (MLLMs) has brought significant improvements in various AI applications. However, fine-grained multi-image understanding remains a significant hurdle, with MLLMs often struggling with issues such as spatial hallucination, attention leakage, and failures in object constancy. A new framework, Compositional Grounded Contrast (CGC), aims to address these challenges while minimizing reliance on expensive human annotations or extensive chain-of-thought data generation.

Overview of Compositional Grounded Contrast (CGC)

CGC is designed to enhance the understanding capabilities of MLLMs by utilizing existing single-image grounding annotations to create a comprehensive training framework. The core components of CGC include:

Inter-Image Contrast: This method introduces semantically decoupled distractor contexts for effective cross-image discrimination. By contrasting images against one another, CGC enables MLLMs to better differentiate between similar objects and contexts.
Intra-Image Contrast: Focused on ensuring object constancy, this approach utilizes correlated cross-view samples. By reinforcing the consistency of objects across different views, MLLMs can improve their understanding and recognition of fine-grained details.
Rule-Based Spatial Reward: Integrated within the GRPO framework, this reward system enhances source-image attribution and spatial alignment. By adopting a Think-before-Grounding paradigm, CGC promotes structured output validity, ensuring that the model’s understanding is both accurate and contextually relevant.

Experimental Results

Initial experiments have demonstrated that CGC achieves state-of-the-art results across several fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The framework’s innovative approach has not only improved multi-image understanding but has also shown promising results in broader multimodal understanding and reasoning tasks.

Specifically, the enhancements brought by CGC have translated into significant gains over the Qwen3-VL-8B base model in various benchmarks:

MathVista: +2.90
MuirBench: +2.88
MMStar: +1.93
MMMU: +1.77
BLINK: +1.69

Conclusion

Compositional Grounded Contrast represents a significant advancement in the quest for improved multi-image understanding within MLLMs. By leveraging existing annotations and introducing novel contrastive learning methods, CGC addresses key challenges that have hindered the performance of previous models. The results from initial benchmarks indicate that CGC not only enhances multi-image processing but also contributes to broader applications in multimodal understanding. As the field of AI continues to evolve, frameworks like CGC pave the way for more robust and accurate models capable of understanding complex visual and textual information.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

CGC: Enhancing Fine-Grained Multi-Image Understanding

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

Overview of Compositional Grounded Contrast (CGC)

Experimental Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related