Dynamic Sparsity in Tree-Structured Feed-Forward Layers at Scale
Summary: arXiv:2604.08565v1 Announce Type: cross
Abstract
At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer’s compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied for autoregressive language modeling and downstream question answering, including zero- and few-shot settings, and its scalability beyond 1B parameters.
Introduction
In recent years, the growth in the size of transformer models has led to increased interest in efficient computation strategies. The feed-forward layers, particularly multi-layer perceptrons (MLPs), have been identified as significant consumers of computational resources. This paper investigates innovative methods to optimize these layers using dynamic sparsity techniques.
Methodology
The proposed method introduces tree-structured feed-forward layers that operate under conditional computation. This approach utilizes hard hierarchical routing to activate only a subset of the neurons within the feed-forward blocks, significantly reducing the computational load while maintaining performance. The absence of a separate routing network streamlines the architecture, enhancing efficiency.
Results
Our experiments demonstrate that activating fewer than 5% of the feed-forward block’s units per token allows models to match the performance of dense baselines. This finding is particularly important in the context of autoregressive language modeling and question answering tasks. Additionally, the scalability of this approach has been validated for models with over 1 billion parameters.
Emergent Auto-Pruning Effect
One of the significant findings of this research is the emergence of an auto-pruning effect. The interaction between hard routing and asymmetric nonlinearities leads to the gradual deactivation of unused paths within the network. This results in a transition from dynamic routing to a form of static structural sparsity.
Architectural Choices
Further analysis reveals that specific architectural choices can influence the auto-pruning behavior. By adjusting these parameters, it is possible to recover balanced tree structures without the need for additional loss functions. This flexibility is crucial for practitioners looking to optimize transformer architectures for various applications.
Conclusion
In conclusion, tree-structured feed-forward layers present a promising avenue for reducing the computational demands of large transformer models. The findings highlight the efficacy of conditional sparsity as a viable strategy for maintaining performance while enhancing efficiency. As the demand for larger models continues to rise, such innovations will be critical in shaping the future of deep learning architectures.
Future Work
Future research will explore further optimizations in the routing mechanisms and investigate potential applications in other domains beyond language modeling. The insights gained from this study lay the groundwork for the next generation of efficient transformer models.
