Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
Summary: arXiv:2604.19147v1
Announce Type: cross
Abstract
Scaling Transformers typically necessitates training larger models from scratch, as standard architectures struggle to expand without discarding learned representations. We identify the primary bottleneck in the attention mechanism’s linear projections, which strictly confine feature extraction to fixed-dimensional subspaces, limiting both expressivity and incremental capacity. To address this, we introduce Nexusformer, which replaces linear $Q/K/V$ projections with a Nexus-Rank layer, a three-stage nonlinear mapping driven by dual activations in progressively higher dimensional spaces.
This design overcomes the linearity constraint and enables lossless structured growth: new capacity can be injected along two axes via zero-initialized blocks that preserve pretrained knowledge. Experiments on language modeling and reasoning benchmarks demonstrate that Nexusformer matches Tokenformer’s perplexity using up to 41.5% less training compute during progressive scaling (240M to 440M). Furthermore, our analysis of growth dynamics reveals that zero initialization induces a stable convergence trajectory, allowing us to derive a geometric scaling law that accurately predicts performance across expansion scales.
Introduction
The Transformer architecture has revolutionized the field of natural language processing and other AI domains. However, the challenge of scaling these models effectively has led to significant research efforts. Traditional methods often require retraining larger models from scratch, which can be both resource-intensive and inefficient.
Nexusformer Overview
Nexusformer addresses the limitations of standard Transformer architectures by introducing a novel approach to attention mechanisms. The key innovations include:
- Nexus-Rank Layer: A three-stage nonlinear mapping that replaces conventional linear projections.
- Dual Activations: Utilization of dual activations in progressively higher dimensional spaces to enhance feature extraction.
- Lossless Growth: The ability to inject new capacity without losing previously learned representations.
Performance and Efficiency
In experiments conducted on various language modeling and reasoning benchmarks, Nexusformer has shown impressive results. Notably:
- Nexusformer matches the perplexity of Tokenformer while requiring significantly less computational resources.
- During progressive scaling from 240M to 440M parameters, it achieved up to 41.5% reduction in training compute.
- The zero-initialized blocks contribute to a stable convergence trajectory, enhancing model reliability during training.
Growth Dynamics and Scaling Laws
One of the standout features of Nexusformer is its unique growth dynamics. The combination of zero initialization and the Nexus-Rank layer allows for:
- Predictable performance outcomes across different expansion scales.
- A derived geometric scaling law that can inform future scaling efforts in AI model development.
Conclusion
Nexusformer represents a significant advancement in the field of Transformer architecture. By overcoming the limitations of linear attention mechanisms and enabling lossless scaling, it paves the way for more efficient and effective AI models in various applications. As research in this area continues to evolve, Nexusformer may serve as a foundation for future innovations.
