HiFloat4 Format for Language Model Pre-training on Ascend NPUs
Summary: arXiv:2604.08826v1 Announce Type: cross
Abstract: Large foundation models have become central to modern machine learning, with performance scaling predictably with model size and data. However, training and deploying such models incur substantial computational and memory costs, motivating the development of low-precision training techniques.
Recent work has demonstrated that 4-bit floating-point (FP4) formats—such as MXFP4 and NVFP4—can be successfully applied to linear GEMM operations in large language models (LLMs), achieving up to 4x improvements in compute throughput and memory efficiency compared to higher-precision baselines.
Introduction
In this work, we investigate the recently proposed HiFloat4 FP4 format for Huawei Ascend NPUs and systematically compare it with MXFP4 in large-scale training settings. All experiments are conducted on Ascend NPU clusters, with linear and expert GEMM operations performed entirely in FP4 precision.
Methodology
We evaluated both dense architectures and mixture-of-experts (MoE) models, where both standard linear layers and expert-specific GEMMs operate in FP4. The dense architectures included models such as Pangu and LLaMA-style, which are widely recognized in the field of language modeling.
Stabilization Techniques
Additionally, we explored stabilization techniques tailored to FP4 training that significantly reduce numerical degradation. These techniques are crucial for maintaining model performance, allowing us to keep the relative error within 1% of full-precision baselines while still benefiting from the efficiency of 4-bit computation.
Results
Our results provide a comprehensive empirical study of FP4 training on NPUs, highlighting several key findings:
- The HiFloat4 format offers notable advantages over MXFP4 in terms of computational efficiency.
- Both dense models and MoE architectures demonstrated improved memory usage without sacrificing performance.
- Stabilization techniques were effective in mitigating numerical issues, allowing FP4 to be a viable option for large-scale training.
Conclusion
In conclusion, our investigation into the HiFloat4 format for training language models on Ascend NPUs reveals significant potential for reducing the computational burden associated with large foundation models. By leveraging low-precision formats like FP4, researchers can achieve remarkable improvements in both training efficiency and memory utilization while maintaining model performance close to that of higher-precision counterparts.
These findings contribute to the ongoing efforts in the machine learning community to develop more efficient training paradigms, which are essential as the scale of models continues to grow.
