Steering Sparse Autoencoder Latents to Control Dynamic Head Pruning in Vision Transformers
Summary: arXiv:2603.26743v1 Announce Type: cross
Abstract
Dynamic head pruning in Vision Transformers (ViTs) enhances model efficiency by eliminating redundant attention heads. However, the current pruning policies pose challenges in terms of interpretability and control. In our study, we introduce an innovative framework that merges Sparse Autoencoders (SAEs) with dynamic pruning techniques. This integration utilizes the capability of SAEs to decompose dense embeddings into interpretable and manageable sparse latents.
Introduction
The emergence of Vision Transformers has revolutionized the field of computer vision by leveraging the self-attention mechanism. Nonetheless, the proliferation of attention heads often leads to computational inefficiencies. Traditional dynamic pruning methods frequently lack transparency, making it difficult for researchers and practitioners to understand the underlying processes influencing pruning decisions. Our research aims to address this gap by offering a solution that enhances both the efficiency and interpretability of pruning strategies.
Methodology
Our approach involves training a Sparse Autoencoder on the final-layer residual embedding of the ViT. By amplifying the sparse latents through various strategies, we can influence the pruning decisions made by the model. The key strategies we explored include:
- Per-class Steering: This strategy identifies compact, class-specific subsets of heads that help maintain classification accuracy.
- Latent Amplification: Adjusting the sparse latents to enhance the model’s decision-making process regarding which heads to retain or prune.
Results
Our experimental results demonstrate the effectiveness of sparse latent features in controlling dynamic pruning. One notable example is the application of the per-class steering technique, which significantly improved accuracy while simultaneously reducing head usage. Specifically, we observed an increase in accuracy from 76% to 82% with a corresponding decrease in head usage from 0.72 to 0.33, utilizing heads h2 and h5. These findings indicate that our approach successfully bridges the gap between pruning efficiency and mechanistic interpretability in ViTs.
Conclusion
The integration of Sparse Autoencoders with dynamic head pruning offers a promising avenue for enhancing the performance of Vision Transformers while maintaining interpretability. By enabling class-specific control of pruning decisions, our framework represents a significant advancement in the field, allowing for more efficient models without sacrificing accuracy. Future work will focus on refining these techniques and exploring their applicability across various computer vision tasks.
Future Directions
As we continue to develop this framework, we aim to:
- Explore additional strategies for latent amplification.
- Test the framework on a broader range of datasets.
- Investigate the implications of this approach in real-world applications.
