SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
Recent advancements in large language models (LLMs) have sparked significant interest in optimizing their reasoning capabilities. A new study, highlighted in arXiv:2604.23747v1, challenges the efficacy of mixed-policy optimization methods that combine supervised and reinforcement learning (RL) signals. The research reveals that a more traditional approach, referred to as the SFT-then-RL pipeline, has been unfairly underestimated due to underlying bugs in existing mixed-policy frameworks.
Key Findings
The study identifies two critical bugs that have influenced recent research outcomes:
- CPU-Offloaded Optimizer Bug: This issue, found in DeepSpeed, causes the optimizer to drop intermediate micro-batches during gradient accumulation. This bug affects several downstream frameworks, including TRL, OpenRLHF, and Llama-Factory.
- Loss Aggregation Bug: In OpenRLHF, this bug leads to incorrect weighting of per-mini-batch losses, further skewing results and diminishing the perceived effectiveness of the SFT-then-RL approach.
Together, these bugs have contributed to an inaccurate assessment of the SFT performance, with the optimizer bug largely responsible for the discrepancy. Once these issues were rectified, the SFT-then-RL pipeline demonstrated a significant performance advantage over mixed-policy methods.
Performance Metrics
The research highlights the performance improvements achieved by the corrected SFT-then-RL pipeline:
- On math benchmarks using Qwen2.5-Math-7B, the SFT-then-RL method surpassed mixed-policy methods by +3.8 points.
- When employing Llama-3.1-8B, the performance gap widened, with the SFT-then-RL method outperforming mixed-policy approaches by an impressive +22.2 points.
- Even a truncated variant of the SFT-then-RL method, which utilized only 50 RL steps, outperformed mixed-policy methods while consuming fewer floating-point operations (FLOPs).
Implications for Future Research
The findings from this study have significant implications for the future of LLM development and optimization strategies. By highlighting the flaws in current mixed-policy methods, researchers can re-evaluate their approaches and align their benchmarks more accurately with the capabilities of established methods like SFT-then-RL.
This research not only reaffirms the strength of the traditional SFT-then-RL pipeline but also emphasizes the importance of rigorous testing and validation of optimization frameworks in the rapidly evolving field of AI and machine learning. As the community continues to strive for higher-performing models, ensuring the integrity of baseline comparisons will be crucial for advancing the state of LLM reasoning.
Conclusion
The study serves as a critical reminder that thorough investigation into the underlying mechanisms of optimization methods is essential for achieving meaningful advancements in artificial intelligence. As researchers address these identified bugs and refine their methodologies, the SFT-then-RL pipeline is poised to remain a cornerstone of effective LLM reasoning strategies.
Related AI Insights
- CyberCane: Privacy-Preserving Phishing Detection with Ontology
- Efficient Far-Field Anomaly Detection in Expressway Videos
- Safe Uncertainty-Aware Reinforcement Learning with CAPSULE
- Physics-Informed Load Forecasting for U.S. Grid Resilience
- Agri-CPJ: Explainable Pest Diagnosis Without Training
- PhysCodeBench: Benchmarking Physics-Aware 3D Simulations
- FlowPlace: Efficient Chip Placement with Flow Matching
- Age-Specific Models Improve Hypoglycemia Classification in T1D
- MTRouter: Cost-Efficient Multi-Turn LLM Routing System
- Behavior Understanding Alignment: LLMs Predict Daily Actions
