Nemotron 3 Super: A Breakthrough in AI Model Architecture
In the ever-evolving landscape of artificial intelligence, the introduction of the Nemotron 3 Super model marks a significant milestone. This innovative model, detailed in the recent pre-print arXiv:2604.12374v1, showcases a hybrid Mamba-Attention Mixture-of-Experts architecture designed for agentic reasoning. With an impressive 120 billion parameters, of which 12 billion are actively utilized, the Nemotron 3 Super is engineered for high efficiency and effectiveness in processing and generating natural language.
Key Features of Nemotron 3 Super
The Nemotron 3 Super model is distinguished by several groundbreaking features that enhance its performance and usability:
- Pre-training in NVFP4: This is the first model in the Nemotron 3 family to be pre-trained using the NVFP4 framework, which optimizes the training process for better performance.
- LatentMoE Architecture: The incorporation of LatentMoE represents a new Mixture-of-Experts architecture that focuses on maximizing accuracy per floating point operation (FLOP) and per parameter, thereby improving overall efficiency.
- MTP Layers for Inference Acceleration: The model includes MTP (Multi-Task Predictive) layers that facilitate inference acceleration through native speculative decoding, allowing for quicker response times.
Training and Performance Metrics
The training regimen for Nemotron 3 Super involved an extensive pre-training phase on 25 trillion tokens, followed by post-training methods that included supervised fine-tuning (SFT) and reinforcement learning (RL). This comprehensive training approach ensures that the model is well-equipped to handle a wide array of tasks effectively.
Upon completion of its training, Nemotron 3 Super achieved remarkable performance metrics, supporting context lengths of up to 1 million tokens. Additionally, the model has demonstrated competitive accuracy on various common benchmarks within the field of AI language processing.
Increased Inference Throughput
One of the standout achievements of Nemotron 3 Super is its enhanced inference throughput. The model exhibits up to 2.2 times higher inference throughput compared to the GPT-OSS-120B, and an impressive 7.5 times higher throughput compared to Qwen3.5-122B. This substantial increase in efficiency positions Nemotron 3 Super as a leading contender in the realm of AI models, making it a valuable tool for developers and researchers alike.
Open Source Availability
In keeping with the spirit of collaboration and transparency in the AI community, the datasets used for training Nemotron 3 Super, along with the base, post-trained, and quantized checkpoints, are made available as open-source on HuggingFace. This initiative allows other researchers and developers to explore and build upon the capabilities of this advanced model, fostering further innovation in the field.
Conclusion
The Nemotron 3 Super model represents a significant advancement in the development of AI language models, combining cutting-edge architecture with robust training methodologies. Its open-source nature and impressive performance metrics pave the way for future research and applications in agentic reasoning and beyond.
