A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models
Summary: arXiv:2604.13440v1 Announce Type: cross
As the deployment of Large Language Models (LLMs) on edge devices continues to grow, the need for efficient computational and memory management becomes increasingly critical. These constraints often hinder real-time processing capabilities and the potential for on-device intelligence. Recent advancements in hybrid architectures that integrate Structured State Space Models (SSMs) with transformer-based LLMs have emerged as a promising solution to balance efficiency and performance.
One significant challenge in this domain is the application of aggressive quantization techniques, which can significantly reduce model size and accelerate inference. However, the uneven impact of quantization on different model components necessitates a careful and strategic approach to manage potential degradation in performance.
Proposed Framework
In light of these challenges, we propose a lightweight, backpropagation-free, surrogate-based sensitivity analysis framework. This innovative approach is designed to identify the components of hybrid SSM-Transformer models that are most vulnerable to quantization-induced degradation. Our method utilizes forward-pass metrics, thereby eliminating the need for expensive gradient computations and extensive retraining processes. This aspect makes our framework particularly advantageous in scenarios where access to in-domain data is limited due to proprietary restrictions or privacy concerns.
Key Findings
Our research includes a formal analysis demonstrating that the Kullback-Leibler (KL) divergence metric is more effective in capturing quantization sensitivity for language modeling tasks compared to traditional alternatives such as:
- Mean Squared Error (MSE)
- Signal-to-Quantization-Noise Ratio (SQNR)
Through comprehensive experiments on SSM and hybrid architectures, our ablation studies reveal that KL-based rankings align with observed performance declines and surpass the effectiveness of alternative metrics.
Real-World Validation
To further substantiate our approach, we conducted real-world on-device profiling on Intel Lunar Lake hardware. The results indicate that KL-guided mixed-precision quantization achieves performance levels nearing that of FP16 perplexity while maintaining competitive model sizes and throughput compared to Uniform INT4 across both CPU and GPU execution modes.
Conclusion
The framework we introduce facilitates the practical deployment of advanced hybrid models on resource-constrained edge devices with minimal accuracy loss. This advancement represents a significant step forward in the quest for efficient AI model deployment, enabling more robust on-device intelligence and real-time processing capabilities.
For those interested in exploring this further, the code for our framework is available at https://github.com/jasonkongie/kl-ssm-quant.
