SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization
Summary: arXiv:2604.13515v1 Announce Type: cross
Abstract
Supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) is a common post-training recipe. We conduct a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B (thinking disabled) post-trained for Lean 4 autoformalization under six conditions that differ solely in training recipe: a base model, SFT-only, GRPO-only, and three SFT+GRPO configurations where 0 percent, 30 percent, or 100 percent of the GRPO prompts coincide with the SFT corpus. Keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute cost. Evaluating on Gaokao-Formal and PutnamBench under both compile pass at k and semantic pass at k assessed by an LLM judge, we find that lower overlap is monotonically associated with higher compilation and semantic accuracy. At 0 percent overlap, GRPO yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while at 100 percent overlap both metrics remain flat, rendering the GRPO stage effectively redundant. We further show that dual-metric evaluation reveals compile semantic gaps exceeding 30 percentage points for the highest compiling models, a disparity invisible under compile-only benchmarking. To our knowledge, this is the first controlled investigation of SFT-GRPO data overlap as a post-training hyperparameter, demonstrating how model behavior varies based on the degree of data sharing between training stages.
Introduction
The exploration of post-training hyperparameters has become increasingly vital in the research of artificial intelligence and machine learning. One innovative approach involves leveraging the interplay between Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). This study aims to shed light on the critical aspect of data overlap between these two stages and its implications for model performance.
Methodology
In our controlled ablation study, we focused on the Qwen3-8B model, which was subject to varying training conditions. The configurations were designed to measure the impact of data overlap on the model’s performance, specifically in the context of Lean 4 autoformalization. The conditions included:
- Base Model
- SFT-only
- GRPO-only
- SFT + GRPO (0% overlap)
- SFT + GRPO (30% overlap)
- SFT + GRPO (100% overlap)
Results
The results revealed that keeping SFT and GRPO data distinct consistently yielded better performance compared to configurations with full overlap. Notably, under the condition of 0% overlap, the GRPO stage demonstrated a 10.4 percentage point gain in semantic accuracy over SFT alone when evaluated on the Gaokao dataset. In contrast, when the overlap reached 100%, both semantic and compilation accuracy metrics plateaued, indicating that the GRPO stage became redundant.
Discussion
These findings suggest a significant relationship between data overlap and model efficacy. The dual-metric evaluation highlighted discrepancies in performance that are not apparent under traditional compile-only assessments. The observed compile-semantic gaps exceeded 30 percentage points, emphasizing the necessity for thorough evaluation methods in model training.
Conclusion
In conclusion, this research represents a pioneering step in understanding SFT-GRPO data overlap as a post-training hyperparameter. The implications of these findings are profound, providing a framework for optimizing training strategies and enhancing model performance in future AI applications.
