SFT-then-RL Beats Mixed-Policy Methods in LLM Reasoning

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

Recent advancements in large language models (LLMs) have sparked significant interest in optimizing their reasoning capabilities. A new study, highlighted in arXiv:2604.23747v1, challenges the efficacy of mixed-policy optimization methods that combine supervised and reinforcement learning (RL) signals. The research reveals that a more traditional approach, referred to as the SFT-then-RL pipeline, has been unfairly underestimated due to underlying bugs in existing mixed-policy frameworks.

Key Findings

The study identifies two critical bugs that have influenced recent research outcomes:

CPU-Offloaded Optimizer Bug: This issue, found in DeepSpeed, causes the optimizer to drop intermediate micro-batches during gradient accumulation. This bug affects several downstream frameworks, including TRL, OpenRLHF, and Llama-Factory.
Loss Aggregation Bug: In OpenRLHF, this bug leads to incorrect weighting of per-mini-batch losses, further skewing results and diminishing the perceived effectiveness of the SFT-then-RL approach.

Together, these bugs have contributed to an inaccurate assessment of the SFT performance, with the optimizer bug largely responsible for the discrepancy. Once these issues were rectified, the SFT-then-RL pipeline demonstrated a significant performance advantage over mixed-policy methods.

Performance Metrics

The research highlights the performance improvements achieved by the corrected SFT-then-RL pipeline:

On math benchmarks using Qwen2.5-Math-7B, the SFT-then-RL method surpassed mixed-policy methods by +3.8 points.
When employing Llama-3.1-8B, the performance gap widened, with the SFT-then-RL method outperforming mixed-policy approaches by an impressive +22.2 points.
Even a truncated variant of the SFT-then-RL method, which utilized only 50 RL steps, outperformed mixed-policy methods while consuming fewer floating-point operations (FLOPs).

Implications for Future Research

The findings from this study have significant implications for the future of LLM development and optimization strategies. By highlighting the flaws in current mixed-policy methods, researchers can re-evaluate their approaches and align their benchmarks more accurately with the capabilities of established methods like SFT-then-RL.

This research not only reaffirms the strength of the traditional SFT-then-RL pipeline but also emphasizes the importance of rigorous testing and validation of optimization frameworks in the rapidly evolving field of AI and machine learning. As the community continues to strive for higher-performing models, ensuring the integrity of baseline comparisons will be crucial for advancing the state of LLM reasoning.

Conclusion

The study serves as a critical reminder that thorough investigation into the underlying mechanisms of optimization methods is essential for achieving meaningful advancements in artificial intelligence. As researchers address these identified bugs and refine their methodologies, the SFT-then-RL pipeline is poised to remain a cornerstone of effective LLM reasoning strategies.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

SFT-then-RL Beats Mixed-Policy Methods in LLM Reasoning

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

Key Findings

Performance Metrics

Implications for Future Research

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related