Discovering Object-Centric Features in Self-Supervised Vision Transformers

Date:

Finding Distributed Object-Centric Properties in Self-Supervised Transformers

Summary: arXiv:2603.26127v1 Announce Type: cross

Abstract

Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions.

Key Findings

In our analysis, we computed inter-patch similarity using patch-level attention components (query, key, and value) across all layers. Our findings are as follows:

  • Object-centric properties: These are encoded in the similarity maps derived from all three attention components (q, k, v), unlike prior work that utilized only key features or the [CLS] token.
  • Distributed information: The object-centric information is distributed across the network, not just confined to the final layer. This suggests a more nuanced approach to object discovery in Vision Transformers.

Introduction of Object-DINO

Based on these insights, we introduce Object-DINO, a training-free method designed to extract this distributed object-centric information. Object-DINO operates by clustering attention heads across all layers based on the similarities of their patches. This process automatically identifies the object-centric cluster corresponding to all objects within the input data.

Applications and Effectiveness

We demonstrate the effectiveness of Object-DINO through two significant applications:

  • Enhancing unsupervised object discovery: Our method achieved remarkable improvements in CorLoc scores, ranging from +3.6 to +12.4, showcasing the potential of utilizing distributed object-centric information.
  • Mitigating object hallucination: In Multimodal Large Language Models, Object-DINO provides visual grounding, effectively reducing instances of object hallucination and enhancing the models’ reliability.

Conclusion

Our findings indicate that leveraging distributed object-centric information can substantially improve performance in downstream tasks without requiring additional training. The introduction of Object-DINO opens new avenues for research in self-supervised learning and object-centric property extraction, paving the way for more effective and efficient object discovery methodologies in Vision Transformers.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.