SFT-then-RL Beats Mixed-Policy Methods in LLM Reasoning

Date:

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

Recent advancements in large language models (LLMs) have sparked significant interest in optimizing their reasoning capabilities. A new study, highlighted in arXiv:2604.23747v1, challenges the efficacy of mixed-policy optimization methods that combine supervised and reinforcement learning (RL) signals. The research reveals that a more traditional approach, referred to as the SFT-then-RL pipeline, has been unfairly underestimated due to underlying bugs in existing mixed-policy frameworks.

Key Findings

The study identifies two critical bugs that have influenced recent research outcomes:

  • CPU-Offloaded Optimizer Bug: This issue, found in DeepSpeed, causes the optimizer to drop intermediate micro-batches during gradient accumulation. This bug affects several downstream frameworks, including TRL, OpenRLHF, and Llama-Factory.
  • Loss Aggregation Bug: In OpenRLHF, this bug leads to incorrect weighting of per-mini-batch losses, further skewing results and diminishing the perceived effectiveness of the SFT-then-RL approach.

Together, these bugs have contributed to an inaccurate assessment of the SFT performance, with the optimizer bug largely responsible for the discrepancy. Once these issues were rectified, the SFT-then-RL pipeline demonstrated a significant performance advantage over mixed-policy methods.

Performance Metrics

The research highlights the performance improvements achieved by the corrected SFT-then-RL pipeline:

  • On math benchmarks using Qwen2.5-Math-7B, the SFT-then-RL method surpassed mixed-policy methods by +3.8 points.
  • When employing Llama-3.1-8B, the performance gap widened, with the SFT-then-RL method outperforming mixed-policy approaches by an impressive +22.2 points.
  • Even a truncated variant of the SFT-then-RL method, which utilized only 50 RL steps, outperformed mixed-policy methods while consuming fewer floating-point operations (FLOPs).

Implications for Future Research

The findings from this study have significant implications for the future of LLM development and optimization strategies. By highlighting the flaws in current mixed-policy methods, researchers can re-evaluate their approaches and align their benchmarks more accurately with the capabilities of established methods like SFT-then-RL.

This research not only reaffirms the strength of the traditional SFT-then-RL pipeline but also emphasizes the importance of rigorous testing and validation of optimization frameworks in the rapidly evolving field of AI and machine learning. As the community continues to strive for higher-performing models, ensuring the integrity of baseline comparisons will be crucial for advancing the state of LLM reasoning.

Conclusion

The study serves as a critical reminder that thorough investigation into the underlying mechanisms of optimization methods is essential for achieving meaningful advancements in artificial intelligence. As researchers address these identified bugs and refine their methodologies, the SFT-then-RL pipeline is poised to remain a cornerstone of effective LLM reasoning strategies.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.