Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning
Summary: arXiv:2603.09697v2 Announce Type: replace-cross
Abstract: Recent advances in spectral optimization, notably Muon, have demonstrated that constraining update steps to the Stiefel manifold can significantly accelerate training and improve generalization. However, Muon implicitly assumes an isotropic optimization landscape, enforcing a uniform spectral update norm across all eigen-directions.
We argue that this “egalitarian” constraint is suboptimal for Deep Neural Networks, where the curvature spectrum is known to be highly heavy-tailed and ill-conditioned. In such landscapes, Muon risks amplifying instabilities in high-curvature directions while limiting necessary progress in flat directions.
Introduction to Mousse
In this work, we propose Mousse (Muon Optimization Utilizing Shampoo’s Structural Estimation), a novel optimizer that reconciles the structural stability of spectral methods with the geometric adaptivity of second-order preconditioning. Mousse represents a significant advancement in the optimization landscape for Deep Neural Networks, addressing the limitations of previous methods.
Key Features of Mousse
Mousse operates under several key principles that distinguish it from traditional optimization methods:
- Anisotropic Trust Region: Unlike Muon, Mousse formulates the update as a solution to a spectral steepest descent problem constrained by an anisotropic trust region.
- Whitened Coordinate System: Mousse operates in a whitened coordinate system induced by Kronecker-factored statistics, which are derived from Shampoo, enhancing stability during optimization.
- Polar Decomposition: The optimal update is derived via the polar decomposition of the whitened gradient, allowing for more nuanced updates based on curvature information.
Empirical Results
To validate the effectiveness of Mousse, we conducted empirical tests across various language models, ranging from 160M to 800M parameters. The results were promising, indicating that:
- Performance Improvement: Mousse consistently outperformed Muon in training efficiency.
- Reduction in Training Steps: The optimizer achieved an approximate 12% reduction in training steps.
- Negligible Computational Overhead: The implementation of Mousse incurs minimal additional computational costs, making it a practical choice for large-scale models.
Conclusion
Mousse represents a significant breakthrough in the realm of optimization for Deep Neural Networks. By addressing the critical limitations of isotropic constraints found in current methods like Muon, Mousse introduces a robust framework that adapts to the geometric characteristics of the optimization landscape. The results from our empirical studies affirm the optimizer’s effectiveness, making it a promising tool for researchers and practitioners looking to enhance training efficiency in machine learning applications.
