Self-Routing: Parameter-Free Expert Routing from Hidden States
Summary: arXiv:2604.00421v1 Announce Type: new
Abstract
Mixture-of-Experts (MoE) layers increase model capacity by activating only a small subset of experts per token, and typically rely on a learned router to map hidden states to expert assignments. In this work, we ask whether a dedicated learned router is strictly necessary in the MoE settings we study. We propose Self-Routing, a parameter-free routing mechanism that uses a designated subspace of the token hidden state directly as expert logits, eliminating the router projection entirely while leaving the rest of the MoE layer unchanged.
Introduction
As deep learning continues to evolve, the demand for more efficient and effective models has led researchers to explore various architectures, one of which is the Mixture-of-Experts (MoE) model. MoE layers allow for increased capacity by activating only a small subset of experts per token, thus optimizing the performance of neural networks. However, the reliance on a learned router to map hidden states to expert assignments has raised questions about its necessity.
Proposed Methodology: Self-Routing
In our approach, Self-Routing, we propose a novel parameter-free routing mechanism that eliminates the need for a dedicated learned router. Instead, we utilize a designated subspace of the token hidden state as expert logits. This innovative method not only simplifies the architecture but also maintains the integrity of the MoE layer. The benefits of Self-Routing are numerous:
- Eliminates the dedicated router projection, reducing the number of parameters.
- Maintains the performance of the MoE layers with minimal adjustments.
- Balances expert utilization, leading to more efficient model training.
Evaluation and Results
We conducted extensive evaluations of Self-Routing using two primary benchmarks: GPT-2-scale language modeling and ImageNet-1K classification. Our comparisons included a standard learned router, random-routing baselines, and dense non-MoE baselines. The results were significant:
- Self-Routing showed competitive performance against the learned-router baseline.
- It achieved a remarkable 17% higher average normalized routing entropy.
- No explicit load-balancing loss was necessary for maintaining expert utilization.
- On ImageNet-1K with DeiT-S/16, Self-Routing slightly outperformed the learned-router MoE.
Conclusions
Our findings provide compelling evidence that effective MoE routing can originate from the hidden representations themselves, negating the need for a separate learned router module. This advancement not only simplifies the architecture of MoE models but also enhances their efficiency and performance. The implications of Self-Routing extend beyond theoretical contributions, potentially influencing future research and applications in natural language processing and computer vision.
