Scalable Pretraining of Large MoE Language Models on Aurora

Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer

In the rapidly evolving field of artificial intelligence, the pretraining of Large Language Models (LLMs) has emerged as a crucial endeavor. A recent study, documented in arXiv:2604.00785v1, highlights the advancements made in this domain using the Aurora supercomputer. This ExaScale machine, equipped with 127,488 Intel Ponte Vecchio (PVC) GPU tiles, facilitates unprecedented scaling of LLM training.

Introduction

The pretraining of LLMs from scratch demands an immense amount of computational resources. The Aurora supercomputer serves as a powerful platform for this task, enabling researchers to explore the scalability of their models. The study presented demonstrates the capabilities of the Aurora supercomputer in training models at the scale of thousands of GPU tiles.

Key Developments

Central to the research is the introduction of Optimus, an in-house training library that supports standard techniques for large model training. The researchers successfully pretrained multiple models, showcasing their effectiveness and the potential for future advancements.

Mula-1B: A 1 Billion dense model pretrained on 3072 GPU tiles using the full 4 trillion tokens of the OLMoE-mix-0924 dataset.
Mula-7B-A1B: A 7 Billion Mixture of Experts (MoE) model that was also pretrained from scratch on the same dataset.
Mula-20B-A2B, Mula-100B-A7B, and Mula-220B-A10B: Three large MoE models pretrained till 100 Billion tokens, demonstrating the scalability of the training process.

Computational Efficiency

The researchers successfully pushed the compute scaling of their largest model, Mula-220B-A10B, from 384 to 12,288 GPU tiles. This scaling effort yielded an impressive scaling efficiency of around 90% at the maximum GPU tile count. Such efficiency is critical as it allows for faster training times and more robust model performance.

Performance Improvements

Significant improvements in the runtime performance of MoE models were achieved through the development of custom GPU kernels designed for expert computation. Additionally, a novel EP-Aware sharded optimizer contributed to training speedups of up to 1.71 times. These enhancements are pivotal for maintaining competitive edge in model training.

Reliability and Fault Tolerance

As part of the Optimus library, the research team also focused on reliability and fault tolerance. The incorporation of robust features aimed at improving training stability and continuity at scale is essential for long-running training sessions, which are often susceptible to interruptions.

Conclusion

The study showcases the remarkable potential of the Aurora supercomputer for large-scale LLM pretraining. With the introduction of Optimus and the successful training of various models, the work sets a precedent for future research in the field of artificial intelligence and machine learning. As the demand for more powerful and efficient LLMs grows, such advancements will play a critical role in shaping the future of AI technologies.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Scalable Pretraining of Large MoE Language Models on Aurora

Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer

Introduction

Key Developments

Computational Efficiency

Performance Improvements

Reliability and Fault Tolerance

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related