Discovering Object-Centric Features in Self-Supervised Vision Transformers

Finding Distributed Object-Centric Properties in Self-Supervised Transformers

Summary: arXiv:2603.26127v1 Announce Type: cross

Abstract

Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions.

Key Findings

In our analysis, we computed inter-patch similarity using patch-level attention components (query, key, and value) across all layers. Our findings are as follows:

Object-centric properties: These are encoded in the similarity maps derived from all three attention components (q, k, v), unlike prior work that utilized only key features or the [CLS] token.
Distributed information: The object-centric information is distributed across the network, not just confined to the final layer. This suggests a more nuanced approach to object discovery in Vision Transformers.

Introduction of Object-DINO

Based on these insights, we introduce Object-DINO, a training-free method designed to extract this distributed object-centric information. Object-DINO operates by clustering attention heads across all layers based on the similarities of their patches. This process automatically identifies the object-centric cluster corresponding to all objects within the input data.

Applications and Effectiveness

We demonstrate the effectiveness of Object-DINO through two significant applications:

Enhancing unsupervised object discovery: Our method achieved remarkable improvements in CorLoc scores, ranging from +3.6 to +12.4, showcasing the potential of utilizing distributed object-centric information.
Mitigating object hallucination: In Multimodal Large Language Models, Object-DINO provides visual grounding, effectively reducing instances of object hallucination and enhancing the models’ reliability.

Conclusion

Our findings indicate that leveraging distributed object-centric information can substantially improve performance in downstream tasks without requiring additional training. The introduction of Object-DINO opens new avenues for research in self-supervised learning and object-centric property extraction, paving the way for more effective and efficient object discovery methodologies in Vision Transformers.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Discovering Object-Centric Features in Self-Supervised Vision Transformers

Finding Distributed Object-Centric Properties in Self-Supervised Transformers

Abstract

Key Findings

Introduction of Object-DINO

Applications and Effectiveness

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related