OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension
The deployment of Large Language Models (LLMs) at high throughput has necessitated advancements in model quantization techniques, particularly the use of 4-bit quantization. However, a significant challenge arises due to the presence of activation outliers, which can cause considerable degradation in model accuracy as they exceed the restricted dynamic range inherent in low-bit formats. In response to this issue, a novel framework named OSC (Outlier Suppression through Clustering) has been proposed to enhance the efficiency of 4-bit quantization.
Understanding Outlier Behavior
Recent research has systematically explored the spatial distribution of outliers in LLMs. The findings reveal a token-persistent structural clustering effect, indicating that high-magnitude outliers tend to occupy fixed channels consistently across different input tokens. This insight suggests that there is a predictable pattern to outlier occurrences, which can be leveraged to improve quantization strategies.
The OSC Framework
Building on the understanding of outlier behavior, OSC implements a dual-path computation methodology during inference. This approach comprises two distinct paths:
- Low-Precision Path: A 4-bit General Matrix Multiplication (GEMM) path that processes the majority of the data with minimal computational overhead.
- High-Precision Path: A 16-bit GEMM branch that is activated when outliers are detected, ensuring that the model maintains accuracy where it is most needed.
To identify the channels where outliers are prevalent, OSC employs an offline group-wise strategy. This process involves structured sub-tensor extraction, which consolidates scattered activation channels into a compact dense tensor. By doing so, OSC provides outlier protection through regularized, high-throughput GEMM operations, making it compatible with contemporary 4-bit micro-scaling hardware.
Fallback Strategy and Performance Evaluation
In scenarios where outlier clustering is less noticeable, particularly with W2 inputs, OSC integrates a fallback strategy to FP8. This ensures that the framework remains robust across different input types while optimizing performance.
Evaluation of the OSC framework on the Qwen3-8B and Qwen3-30B models has shown promising results. The average accuracy drop was limited to 2.19 points for Qwen3-8B and 1.12 points for Qwen3-30B. These results indicate that OSC effectively mitigates the impact of outliers while maintaining high model accuracy.
Conclusion
A notable advantage of OSC lies in its hardware efficiency, achieving a peak speedup of 1.78 times over the conventional W8A8 GEMM baseline on modern AI accelerators. This improvement underscores the potential for OSC to serve as a pivotal technique in the deployment of low-bit quantization for Large Language Models, enhancing both speed and accuracy in real-world applications.
