Nexusformer: Efficient Nonlinear Transformer Scaling Method

Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling

Summary: arXiv:2604.19147v1

Announce Type: cross

Abstract

Scaling Transformers typically necessitates training larger models from scratch, as standard architectures struggle to expand without discarding learned representations. We identify the primary bottleneck in the attention mechanism’s linear projections, which strictly confine feature extraction to fixed-dimensional subspaces, limiting both expressivity and incremental capacity. To address this, we introduce Nexusformer, which replaces linear $Q/K/V$ projections with a Nexus-Rank layer, a three-stage nonlinear mapping driven by dual activations in progressively higher dimensional spaces.

This design overcomes the linearity constraint and enables lossless structured growth: new capacity can be injected along two axes via zero-initialized blocks that preserve pretrained knowledge. Experiments on language modeling and reasoning benchmarks demonstrate that Nexusformer matches Tokenformer’s perplexity using up to 41.5% less training compute during progressive scaling (240M to 440M). Furthermore, our analysis of growth dynamics reveals that zero initialization induces a stable convergence trajectory, allowing us to derive a geometric scaling law that accurately predicts performance across expansion scales.

Introduction

The Transformer architecture has revolutionized the field of natural language processing and other AI domains. However, the challenge of scaling these models effectively has led to significant research efforts. Traditional methods often require retraining larger models from scratch, which can be both resource-intensive and inefficient.

Nexusformer Overview

Nexusformer addresses the limitations of standard Transformer architectures by introducing a novel approach to attention mechanisms. The key innovations include:

Nexus-Rank Layer: A three-stage nonlinear mapping that replaces conventional linear projections.
Dual Activations: Utilization of dual activations in progressively higher dimensional spaces to enhance feature extraction.
Lossless Growth: The ability to inject new capacity without losing previously learned representations.

Performance and Efficiency

In experiments conducted on various language modeling and reasoning benchmarks, Nexusformer has shown impressive results. Notably:

Nexusformer matches the perplexity of Tokenformer while requiring significantly less computational resources.
During progressive scaling from 240M to 440M parameters, it achieved up to 41.5% reduction in training compute.
The zero-initialized blocks contribute to a stable convergence trajectory, enhancing model reliability during training.

Growth Dynamics and Scaling Laws

One of the standout features of Nexusformer is its unique growth dynamics. The combination of zero initialization and the Nexus-Rank layer allows for:

Predictable performance outcomes across different expansion scales.
A derived geometric scaling law that can inform future scaling efforts in AI model development.

Conclusion

Nexusformer represents a significant advancement in the field of Transformer architecture. By overcoming the limitations of linear attention mechanisms and enabling lossless scaling, it paves the way for more efficient and effective AI models in various applications. As research in this area continues to evolve, Nexusformer may serve as a foundation for future innovations.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Nexusformer: Efficient Nonlinear Transformer Scaling Method

Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling

Abstract

Introduction

Nexusformer Overview

Performance and Efficiency

Growth Dynamics and Scaling Laws

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related