Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning
Summary: arXiv:2509.25300v4 Announce Type: replace-cross
The exploration of scaling laws for large language models (LLMs) during pre-training has gained significant attention in recent years. However, the understanding of their behaviors under reinforcement learning (RL) post-training remains an under-researched area. This article delves into a systematic empirical investigation of scaling behaviors in RL-based post-training, focusing specifically on mathematical reasoning.
Research Overview
This study is anchored in a series of experiments conducted across the entire Qwen2.5 dense model series, encompassing models ranging from 0.5 billion to 72 billion parameters. The aim is to characterize the interplay among model scale, data volume, and computational budget and how these factors collectively influence performance. By analyzing the results, we uncover vital insights into the scaling behaviors of LLMs in the context of RL post-training.
Key Findings
- Larger Models Demonstrate Superior Learning Efficiency: One of the most significant observations is that larger models consistently showcase enhanced learning efficiency. This finding applies to both computational and data metrics, indicating that as model size increases, so does the ability to learn effectively from provided data.
- Power-Law Relationship: Our analysis reveals that the relationship between test loss, compute, and data can be accurately modeled using a predictive power-law. This relationship remains robust across both base and instruction-tuned models, suggesting a fundamental principle governing the efficiency of learning in LLMs.
- Latent Saturation Trend: Despite the higher learning efficiency exhibited by larger models, an intriguing trend emerges. The analytical learning efficiency term k(N) in the power-law indicates a latent saturation trend in learning efficiency as model size escalates, suggesting that simply increasing model size may not yield proportional gains in learning capability.
- Importance of Data Quality Over Uniqueness: In scenarios constrained by data availability, our findings emphasize the effectiveness of repeatedly reusing high-quality data. The final performance of the models is primarily driven by the total number of optimization steps rather than the uniqueness of the samples, highlighting a strategic approach to data utilization in RL post-training.
Conclusion
Collectively, these results not only provide a principled foundation for understanding the scaling behaviors of LLMs in reinforcement learning post-training but also offer practical guidelines for researchers and practitioners aiming to enhance the reasoning capabilities of these models. By recognizing the intricate balance between model size, data quality, and computational resources, stakeholders can make informed decisions to optimize LLM performance in real-world applications.
As the field of AI continues to evolve, ongoing research in this domain will be crucial for unlocking the full potential of large language models, particularly in complex reasoning tasks.
