Time is Not Compute: Scaling Laws for Wall-Clock Constrained Training on Consumer GPUs
Summary: arXiv:2603.28823v1 Announce Type: cross
Abstract
Scaling laws generally relate model quality to compute budget (measured in FLOPs), but practitioners in the field often encounter constraints based on wall-clock time rather than compute budgets. This article explores the optimal sizing of models under fixed time budgets that range from 5 minutes to 24 hours, specifically utilizing consumer GPUs like the RTX 4090. The research spans over 70 runs, examining model parameters ranging from 50 million to 1 billion.
Key Findings
The study reveals several critical insights regarding model training under time constraints:
- U-Shaped Curve: For each time budget, a U-shaped curve is observed. This indicates that models that are too small tend to overfit, while those that are excessively large may undertrain.
- Optimal Model Size: The optimal model size can be expressed as N* proportional to t0.60, suggesting that optimal model size grows faster than the previously established Chinchilla scaling law, which indicates N* proportional to C0.50. The exponent α is calculated to be 0.60 ± 0.07, consistently exceeding compute-optimal across all sensitivity analyses.
- Dual U-Shape Mechanism: The study introduces a dual U-shape mechanism wherein short-budget U-curves are influenced by compute bottlenecks, while long-budget U-curves arise from data bottlenecks leading to overfitting. An intermediate regime is identified where the U-curve temporarily disappears, highlighting the complexity of model training dynamics.
Implications for Researchers
These findings carry significant implications for researchers who are training models using consumer hardware. The primary takeaway is that wall-clock time, rather than FLOPs, becomes the binding constraint when optimizing model performance. This shift in focus can lead to more effective training strategies that are better suited to the capabilities of consumer-grade GPUs.
Future Work
In light of these findings, further research is encouraged to explore additional parameters that may affect model training under time constraints. Understanding these dynamics could lead to the development of more robust training protocols and methodologies, ultimately advancing the field of machine learning.
Resources
To support the research community, we are releasing all code, logs, and over 70 experimental configurations used throughout this study. This transparency will enable others to replicate the findings and build upon this work, fostering collaborative advancements in the field.
