Ge$^\text{2}$mS-T: Multi-Dimensional Grouping for Ultra-High Energy Efficiency in Spiking Transformer
Summary: arXiv:2604.08894v1 Announce Type: cross
Abstract: Spiking Neural Networks (SNNs) offer superior energy efficiency over Artificial Neural Networks (ANNs). However, they encounter significant deficiencies in training and inference metrics when applied to Spiking Vision Transformers (S-ViTs). Existing paradigms including ANN-SNN Conversion and Spatial-Temporal Backpropagation (STBP) suffer from inherent limitations, precluding concurrent optimization of memory, accuracy and energy consumption. To address these issues, we propose Ge$^\text{2}$mS-T, a novel architecture implementing grouped computation across temporal, spatial and network structure dimensions.
Specifically, we introduce the Grouped-Exponential-Coding-based IF (ExpG-IF) model, enabling lossless conversion with constant training overhead and precise regulation for spike patterns. Additionally, we develop Group-wise Spiking Self-Attention (GW-SSA) to reduce computational complexity via multi-scale token grouping and multiplication-free operations within a hybrid attention-convolution framework. Experiments confirm that our method can achieve superior performance with ultra-high energy efficiency on challenging benchmarks.
To our best knowledge, this is the first work to systematically establish multi-dimensional grouped computation for resolving the triad of memory overhead, learning capability and energy budget in S-ViTs.
Introduction
As the demand for energy-efficient AI models grows, Spiking Neural Networks (SNNs) have emerged as a promising alternative to traditional Artificial Neural Networks (ANNs). However, challenges remain, particularly in their application to Spiking Vision Transformers (S-ViTs). The need for effective training and inference metrics has become increasingly urgent.
Overview of Key Innovations
- Grouped-Exponential-Coding-based IF (ExpG-IF) Model: This innovative model promotes lossless conversion while maintaining constant training overhead. The capacity for precise regulation of spike patterns is a significant advantage.
- Group-wise Spiking Self-Attention (GW-SSA): This development focuses on reducing computational complexity. By utilizing multi-scale token grouping and multiplication-free operations, GW-SSA enhances performance within a hybrid attention-convolution framework.
- Multi-Dimensional Grouped Computation: This is a systematic approach to simultaneously address memory overhead, learning capability, and energy budget, making it a groundbreaking advancement in the field.
Experimental Results and Performance
Our experiments demonstrate that the Ge$^\text{2}$mS-T architecture not only meets but exceeds expectations in terms of energy efficiency and performance. The benchmarks reveal a substantial improvement over existing models, underscoring the effectiveness of our grouped computation strategy.
Conclusion
In conclusion, the Ge$^\text{2}$mS-T architecture represents a pivotal advancement in the development of Spiking Vision Transformers. By implementing multi-dimensional grouped computation, we pave the way for future innovations in energy-efficient AI models. Our findings could inspire further research aimed at optimizing SNNs and enhancing their practical applications across various domains.
