Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer
In the rapidly evolving field of artificial intelligence, the pretraining of Large Language Models (LLMs) has emerged as a crucial endeavor. A recent study, documented in arXiv:2604.00785v1, highlights the advancements made in this domain using the Aurora supercomputer. This ExaScale machine, equipped with 127,488 Intel Ponte Vecchio (PVC) GPU tiles, facilitates unprecedented scaling of LLM training.
Introduction
The pretraining of LLMs from scratch demands an immense amount of computational resources. The Aurora supercomputer serves as a powerful platform for this task, enabling researchers to explore the scalability of their models. The study presented demonstrates the capabilities of the Aurora supercomputer in training models at the scale of thousands of GPU tiles.
Key Developments
Central to the research is the introduction of Optimus, an in-house training library that supports standard techniques for large model training. The researchers successfully pretrained multiple models, showcasing their effectiveness and the potential for future advancements.
- Mula-1B: A 1 Billion dense model pretrained on 3072 GPU tiles using the full 4 trillion tokens of the OLMoE-mix-0924 dataset.
- Mula-7B-A1B: A 7 Billion Mixture of Experts (MoE) model that was also pretrained from scratch on the same dataset.
- Mula-20B-A2B, Mula-100B-A7B, and Mula-220B-A10B: Three large MoE models pretrained till 100 Billion tokens, demonstrating the scalability of the training process.
Computational Efficiency
The researchers successfully pushed the compute scaling of their largest model, Mula-220B-A10B, from 384 to 12,288 GPU tiles. This scaling effort yielded an impressive scaling efficiency of around 90% at the maximum GPU tile count. Such efficiency is critical as it allows for faster training times and more robust model performance.
Performance Improvements
Significant improvements in the runtime performance of MoE models were achieved through the development of custom GPU kernels designed for expert computation. Additionally, a novel EP-Aware sharded optimizer contributed to training speedups of up to 1.71 times. These enhancements are pivotal for maintaining competitive edge in model training.
Reliability and Fault Tolerance
As part of the Optimus library, the research team also focused on reliability and fault tolerance. The incorporation of robust features aimed at improving training stability and continuity at scale is essential for long-running training sessions, which are often susceptible to interruptions.
Conclusion
The study showcases the remarkable potential of the Aurora supercomputer for large-scale LLM pretraining. With the introduction of Optimus and the successful training of various models, the work sets a precedent for future research in the field of artificial intelligence and machine learning. As the demand for more powerful and efficient LLMs grows, such advancements will play a critical role in shaping the future of AI technologies.
