Nemotron 3 Super: Efficient 120B Parameter AI Model

Nemotron 3 Super: A Breakthrough in AI Model Architecture

In the ever-evolving landscape of artificial intelligence, the introduction of the Nemotron 3 Super model marks a significant milestone. This innovative model, detailed in the recent pre-print arXiv:2604.12374v1, showcases a hybrid Mamba-Attention Mixture-of-Experts architecture designed for agentic reasoning. With an impressive 120 billion parameters, of which 12 billion are actively utilized, the Nemotron 3 Super is engineered for high efficiency and effectiveness in processing and generating natural language.

Key Features of Nemotron 3 Super

The Nemotron 3 Super model is distinguished by several groundbreaking features that enhance its performance and usability:

Pre-training in NVFP4: This is the first model in the Nemotron 3 family to be pre-trained using the NVFP4 framework, which optimizes the training process for better performance.
LatentMoE Architecture: The incorporation of LatentMoE represents a new Mixture-of-Experts architecture that focuses on maximizing accuracy per floating point operation (FLOP) and per parameter, thereby improving overall efficiency.
MTP Layers for Inference Acceleration: The model includes MTP (Multi-Task Predictive) layers that facilitate inference acceleration through native speculative decoding, allowing for quicker response times.

Training and Performance Metrics

The training regimen for Nemotron 3 Super involved an extensive pre-training phase on 25 trillion tokens, followed by post-training methods that included supervised fine-tuning (SFT) and reinforcement learning (RL). This comprehensive training approach ensures that the model is well-equipped to handle a wide array of tasks effectively.

Upon completion of its training, Nemotron 3 Super achieved remarkable performance metrics, supporting context lengths of up to 1 million tokens. Additionally, the model has demonstrated competitive accuracy on various common benchmarks within the field of AI language processing.

Increased Inference Throughput

One of the standout achievements of Nemotron 3 Super is its enhanced inference throughput. The model exhibits up to 2.2 times higher inference throughput compared to the GPT-OSS-120B, and an impressive 7.5 times higher throughput compared to Qwen3.5-122B. This substantial increase in efficiency positions Nemotron 3 Super as a leading contender in the realm of AI models, making it a valuable tool for developers and researchers alike.

Open Source Availability

In keeping with the spirit of collaboration and transparency in the AI community, the datasets used for training Nemotron 3 Super, along with the base, post-trained, and quantized checkpoints, are made available as open-source on HuggingFace. This initiative allows other researchers and developers to explore and build upon the capabilities of this advanced model, fostering further innovation in the field.

Conclusion

The Nemotron 3 Super model represents a significant advancement in the development of AI language models, combining cutting-edge architecture with robust training methodologies. Its open-source nature and impressive performance metrics pave the way for future research and applications in agentic reasoning and beyond.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Nemotron 3 Super: Efficient 120B Parameter AI Model

Nemotron 3 Super: A Breakthrough in AI Model Architecture

Key Features of Nemotron 3 Super

Training and Performance Metrics

Increased Inference Throughput

Open Source Availability

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related