Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models
Summary: arXiv:2512.13607v2 Announce Type: replace-cross
Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging.
Introduction to Cascade RL
In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop Nemotron-Cascade, capable of operating in both instruct and deep thinking modes, without any performance gap relative to a thinking-only counterpart. This innovative approach departs from conventional methodologies that blend heterogeneous prompts from different domains.
Key Features of Nemotron-Cascade
- Sequential, Domain-Wise RL: Cascade RL orchestrates a sequential approach, focusing on domain-specific reinforcement learning that reduces engineering complexity.
- State-of-the-Art Performance: The model delivers exceptional performance across a wide range of benchmarks, thanks to its specialized training approach.
- Enhanced Reasoning Abilities: Utilizing Reinforcement Learning from Human Feedback (RLHF) for alignment as a pre-step significantly boosts the model’s reasoning capabilities beyond simple preference optimization.
- Robust Performance Maintenance: Subsequent domain-wise RLVR stages rarely degrade the benchmark performance achieved in earlier domains and may even lead to improvements.
Performance Metrics
Our 14B model, after undergoing reinforcement learning, outperforms its supervised fine-tuning (SFT) teacher, DeepSeek-R1-0528, on various benchmarks including LiveCodeBench v5, v6, and Pro. Additionally, it has achieved silver-medal performance in the prestigious 2025 International Olympiad in Informatics (IOI).
Conclusion and Future Work
Nemotron-Cascade represents a significant advancement in the creation of general-purpose reasoning models through the implementation of Cascade RL. By addressing the complexities associated with diverse domains and response characteristics, this model sets a new standard for performance and efficiency in reinforcement learning applications.
We are committed to transparency in our research and development process and will be sharing our training recipes and data methodologies to foster further exploration and advancements in the field.
