TernaryLM: Memory-Efficient Language Modeling via Native 1.5-Bit Quantization with Adaptive Layer-wise Scaling
Large language models (LLMs) have made significant strides in natural language processing, achieving remarkable performance benchmarks. However, their extensive computational requirements pose a challenge for deployment on edge devices and resource-constrained environments. In response to this limitation, researchers have introduced TernaryLM, a transformer model comprising 132 million parameters, designed to optimize memory usage through native ternary quantization.
TernaryLM quantizes the model weights to three distinct values: -1, 0, and +1, effectively providing a log2(3) precision of approximately 1.58 bits. This innovative approach allows for a considerable reduction in memory consumption while maintaining the language modeling capabilities of traditional full-precision models.
Key Features of TernaryLM
- Quantization-Aware Training: Unlike conventional post-training quantization methods that adjust pre-trained models, TernaryLM incorporates quantization-aware representations directly during training. This is achieved using straight-through estimators along with adaptive scaling factors on a per-layer basis.
- Stable Optimization: The model demonstrates a validation perplexity of 58.42 on the TinyStories dataset, with a cross-seed standard deviation of +/- 0.17 PPL, indicating stable optimization processes.
- Strong Transfer Performance: In downstream tasks, TernaryLM achieves an impressive F1 score of 82.47% on the MRPC benchmark, outperforming DistilBERT while utilizing 55 times less pretraining data.
- Memory Efficiency: The model boasts a 2.4x reduction in memory footprint, utilizing only 498 MB compared to 1,197 MB for an FP32 model with the same architecture, while maintaining latency performance.
- Regularization Benefits: The ternary weight constraint introduces an implicit regularization effect, resulting in a train/validation ratio of 1.05x for TernaryLM, compared to a significantly higher 3.51x for the FP32 baseline. This suggests that discrete weights may help mitigate overfitting, especially in smaller datasets.
Layer-wise Analysis and Design Principles
Additionally, TernaryLM includes a comprehensive layer-wise sparsity analysis, revealing that the middle transformer layers (L5-L9) achieve quantization sparsity of 60-62%, while boundary layers demonstrate a 45-55% sparsity. This finding establishes actionable design principles for implementing non-uniform precision allocation across different layers of the model, enhancing both efficiency and performance.
The implementation and trained models of TernaryLM are publicly available for researchers and developers interested in exploring memory-efficient language modeling techniques. They can be accessed at this GitHub repository.
In conclusion, TernaryLM represents a significant advancement in the field of language modeling, enabling high-performing models with reduced resource requirements, thus facilitating broader accessibility and deployment in various applications.
