HiFloat4 FP4 Format Boosts Language Model Training on Ascend NPUs

Date:

HiFloat4 Format for Language Model Pre-training on Ascend NPUs

Summary: arXiv:2604.08826v1 Announce Type: cross

Abstract: Large foundation models have become central to modern machine learning, with performance scaling predictably with model size and data. However, training and deploying such models incur substantial computational and memory costs, motivating the development of low-precision training techniques.

Recent work has demonstrated that 4-bit floating-point (FP4) formats—such as MXFP4 and NVFP4—can be successfully applied to linear GEMM operations in large language models (LLMs), achieving up to 4x improvements in compute throughput and memory efficiency compared to higher-precision baselines.

Introduction

In this work, we investigate the recently proposed HiFloat4 FP4 format for Huawei Ascend NPUs and systematically compare it with MXFP4 in large-scale training settings. All experiments are conducted on Ascend NPU clusters, with linear and expert GEMM operations performed entirely in FP4 precision.

Methodology

We evaluated both dense architectures and mixture-of-experts (MoE) models, where both standard linear layers and expert-specific GEMMs operate in FP4. The dense architectures included models such as Pangu and LLaMA-style, which are widely recognized in the field of language modeling.

Stabilization Techniques

Additionally, we explored stabilization techniques tailored to FP4 training that significantly reduce numerical degradation. These techniques are crucial for maintaining model performance, allowing us to keep the relative error within 1% of full-precision baselines while still benefiting from the efficiency of 4-bit computation.

Results

Our results provide a comprehensive empirical study of FP4 training on NPUs, highlighting several key findings:

  • The HiFloat4 format offers notable advantages over MXFP4 in terms of computational efficiency.
  • Both dense models and MoE architectures demonstrated improved memory usage without sacrificing performance.
  • Stabilization techniques were effective in mitigating numerical issues, allowing FP4 to be a viable option for large-scale training.

Conclusion

In conclusion, our investigation into the HiFloat4 format for training language models on Ascend NPUs reveals significant potential for reducing the computational burden associated with large foundation models. By leveraging low-precision formats like FP4, researchers can achieve remarkable improvements in both training efficiency and memory utilization while maintaining model performance close to that of higher-precision counterparts.

These findings contribute to the ongoing efforts in the machine learning community to develop more efficient training paradigms, which are essential as the scale of models continues to grow.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.