Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees
Summary: arXiv:2604.06515v1 Announce Type: cross
Abstract
Sparse Mixture-of-Experts (MoE) allows for the efficient scaling of language and vision models by activating only a small subset of experts per input. While this approach reduces computation, the large number of parameters still incurs substantial memory overhead during inference. To address this issue, post-training quantization has been explored. However, uniform quantization suffers from significant accuracy loss at low bit-widths. Recently, mixed-precision methods have been developed, but they often require substantial computation for bit-width allocation and overlook the varying sensitivity of model performance to the quantization of different experts.
Introduction
In modern machine learning, achieving efficiency without sacrificing accuracy has become a significant focus, particularly in the context of large models. The Sparse Mixture-of-Experts architecture is a promising approach, allowing models to scale by activating a limited number of experts for each input. This selective activation reduces computational overhead but presents challenges related to memory use during inference due to the extensive number of parameters involved.
Challenges with Quantization
Post-training quantization seeks to alleviate the memory burden by reducing the precision of model weights. Nevertheless, uniform quantization techniques often lead to notable accuracy degradation, especially at lower bit-widths. As a response, mixed-precision methods have emerged, which assign different bit-widths to different parts of a model. Yet, these approaches frequently overlook the unique sensitivities of individual experts to quantization, resulting in suboptimal performance.
Proposed Strategy
To address these limitations, researchers have developed a theoretically grounded expert-wise mixed precision strategy. This innovative method primarily assigns bit-width to each expert based on the change in the routers’ L2 norm during training. The rationale is straightforward: experts that exhibit smaller changes tend to capture less frequent but critical features. Consequently, the model’s performance is more sensitive to the quantization of these experts, necessitating a higher precision allocation.
Key Considerations
In addition to sensitivity analysis, the proposed strategy incorporates another critical factor: the maximum intra-neuron variance of each expert. Experts exhibiting high variance are allocated higher precision to mitigate the effects of quantization noise. This dual approach of analyzing both the change in router norms and intra-neuron variance results in more judicious allocation of bit-widths, enhancing overall model performance.
Empirical Results
Extensive experiments conducted on large-scale MoE models, such as the Switch Transformer and Mixtral, demonstrate the effectiveness of this new quantization strategy. Results indicate that the proposed method not only achieves superior accuracy compared to existing approaches but also significantly reduces inference costs. Furthermore, the overhead incurred for bit-width assignment is negligible, marking a substantial advancement in the efficiency of MoE models.
Conclusion
The introduction of a theoretically grounded expert-wise mixed precision strategy represents a significant leap forward in the efficient quantization of Mixture-of-Experts models. By carefully considering the sensitivity of individual experts and minimizing quantization noise, this method paves the way for more effective scaling of complex models in both language and vision domains, ensuring that efficiency does not come at the cost of accuracy.
