Preservation Is Not Enough for Width Growth: Regime-Sensitive Selection of Dense LM Warm Starts
Summary: arXiv:2604.04281v1 Announce Type: new
Abstract
Width expansion offers a practical route to reuse smaller causal-language-model checkpoints, but selecting a widened warm start is not solved by zero-step preservation alone. In this study, we explore dense width growth as a candidate-selection problem over full training states, which includes copied weights, optimizer moments, and scheduler state.
Introduction
As the demand for efficient language models continues to rise, researchers are investigating methods to enhance existing smaller checkpoints without compromising performance. The ability to expand the width of these models presents a promising avenue for improvement. However, this process is nuanced and requires careful consideration beyond simple preservation techniques.
Methodology
In our study, we conducted experiments using a small-scale TinyStories proxy to compare various warm start strategies. These strategies include:
- Exact-copy warm starts
- Perturbative warm starts
- Asymmetric-reset warm starts
- Structured non-clone warm starts
Each strategy was evaluated under matched continuation budgets to determine their effectiveness in supporting dense width growth.
Evaluation Metrics
We employed several evaluation metrics to gauge performance, including:
- Zero-step preservation
- Short-lag probe metrics
- Downstream continuation utility
These metrics were tested in both deterministic and stochastic regimes to paint a comprehensive picture of each warm start’s capabilities.
Results
The results of our experiments revealed a mixed landscape of performance across the different warm start strategies. Notably, exact-copy symmetric warm starts consistently ranked first in every completed 16-step probe and in stochastic 128-step continuations at seed-0 steps 1000 and 2000, as well as reduced seed-1 step 2000.
Conversely, the structured non-clone challenger excelled in deterministic 128-step continuation scenarios. This indicates that while early escape from the inherited cloned subspace can be beneficial for long deterministic continuations, it does not universally dictate success across all conditions.
Conclusion
The findings of this study suggest that preservation is not a universal criterion for ranking warm starts in the context of dense width growth. Instead, the optimal choice of warm start is influenced by both the regime—whether deterministic or stochastic—and the lag budget. This nuanced understanding can help guide future efforts in fine-tuning language models for enhanced performance.
Implications for Future Research
As researchers continue to explore the complexities of language model training, our results underscore the importance of selecting warm starts with an awareness of the specific conditions in which they will be applied. Future work may delve deeper into the mechanisms behind these findings and examine additional candidate-selection strategies that can further optimize the performance of causal-language-model checkpoints.
