MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization
The rise of large language models (LLMs) has brought about significant advancements in natural language processing. However, the performance gains associated with Mixture-of-Experts (MoE) architectures are often accompanied by high memory and computational costs. A recently proposed solution, known as MoBiE, aims to tackle these challenges by introducing an innovative binarization framework specifically designed for MoE-based LLMs.
Background
Mixture-of-Experts models leverage a subset of available experts to make predictions, effectively enhancing performance while managing resource consumption. However, traditional binary methods that have been effective in dense LLMs encounter unique challenges when applied to MoE architectures. These challenges include:
- Cross-expert redundancy, which can lead to inefficient use of resources.
- Task-agnostic importance estimation that fails to adapt to specific tasks.
- Routing shifts induced by quantization, which can disrupt the model’s inference quality.
Innovations of MoBiE
MoBiE introduces three core innovations to address these issues:
- Joint SVD Decomposition: This technique is employed to minimize cross-expert redundancy, allowing for a more efficient representation of model weights.
- Global Loss Gradients with Local Hessian Metrics: By integrating these two elements, MoBiE enhances weight importance estimation, allowing the model to make more informed decisions regarding which weights to prioritize.
- Error Constraint Guided by Input Null Space: This innovative approach helps mitigate routing distortion caused by quantization, ensuring that model performance remains robust even under constraints.
Performance Evaluation
The performance of MoBiE has been rigorously evaluated across multiple benchmarks and MoE-based LLMs. The results are compelling:
- On the Qwen3-30B-A3B model, MoBiE achieved a remarkable 52.2% reduction in perplexity.
- It improved average zero-shot performance by 43.4%.
- MoBiE also demonstrated over a 2x speedup in inference time.
- Additionally, it significantly reduced quantization time.
Conclusion
MoBiE represents a significant advancement in the field of LLMs, providing a solution to the efficiency challenges posed by MoE architectures without incurring additional storage costs. The combination of innovative techniques ensures that the model maintains high performance while optimizing resource usage. The code for MoBiE is publicly available at GitHub, allowing researchers and practitioners to explore its capabilities further.
