HiFloat4 FP4 Format Boosts Language Model Training on Ascend NPUs

HiFloat4 Format for Language Model Pre-training on Ascend NPUs

Summary: arXiv:2604.08826v1 Announce Type: cross

Abstract: Large foundation models have become central to modern machine learning, with performance scaling predictably with model size and data. However, training and deploying such models incur substantial computational and memory costs, motivating the development of low-precision training techniques.

Recent work has demonstrated that 4-bit floating-point (FP4) formats—such as MXFP4 and NVFP4—can be successfully applied to linear GEMM operations in large language models (LLMs), achieving up to 4x improvements in compute throughput and memory efficiency compared to higher-precision baselines.

Introduction

In this work, we investigate the recently proposed HiFloat4 FP4 format for Huawei Ascend NPUs and systematically compare it with MXFP4 in large-scale training settings. All experiments are conducted on Ascend NPU clusters, with linear and expert GEMM operations performed entirely in FP4 precision.

Methodology

We evaluated both dense architectures and mixture-of-experts (MoE) models, where both standard linear layers and expert-specific GEMMs operate in FP4. The dense architectures included models such as Pangu and LLaMA-style, which are widely recognized in the field of language modeling.

Stabilization Techniques

Additionally, we explored stabilization techniques tailored to FP4 training that significantly reduce numerical degradation. These techniques are crucial for maintaining model performance, allowing us to keep the relative error within 1% of full-precision baselines while still benefiting from the efficiency of 4-bit computation.

Results

Our results provide a comprehensive empirical study of FP4 training on NPUs, highlighting several key findings:

The HiFloat4 format offers notable advantages over MXFP4 in terms of computational efficiency.
Both dense models and MoE architectures demonstrated improved memory usage without sacrificing performance.
Stabilization techniques were effective in mitigating numerical issues, allowing FP4 to be a viable option for large-scale training.

Conclusion

In conclusion, our investigation into the HiFloat4 format for training language models on Ascend NPUs reveals significant potential for reducing the computational burden associated with large foundation models. By leveraging low-precision formats like FP4, researchers can achieve remarkable improvements in both training efficiency and memory utilization while maintaining model performance close to that of higher-precision counterparts.

These findings contribute to the ongoing efforts in the machine learning community to develop more efficient training paradigms, which are essential as the scale of models continues to grow.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

HiFloat4 FP4 Format Boosts Language Model Training on Ascend NPUs

HiFloat4 Format for Language Model Pre-training on Ascend NPUs

Introduction

Methodology

Stabilization Techniques

Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related