Optimizing SFT-GRPO Data Overlap for Autoformalization

Date:

SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization

Summary: arXiv:2604.13515v1 Announce Type: cross

Abstract

Supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) is a common post-training recipe. We conduct a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B (thinking disabled) post-trained for Lean 4 autoformalization under six conditions that differ solely in training recipe: a base model, SFT-only, GRPO-only, and three SFT+GRPO configurations where 0 percent, 30 percent, or 100 percent of the GRPO prompts coincide with the SFT corpus. Keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute cost. Evaluating on Gaokao-Formal and PutnamBench under both compile pass at k and semantic pass at k assessed by an LLM judge, we find that lower overlap is monotonically associated with higher compilation and semantic accuracy. At 0 percent overlap, GRPO yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while at 100 percent overlap both metrics remain flat, rendering the GRPO stage effectively redundant. We further show that dual-metric evaluation reveals compile semantic gaps exceeding 30 percentage points for the highest compiling models, a disparity invisible under compile-only benchmarking. To our knowledge, this is the first controlled investigation of SFT-GRPO data overlap as a post-training hyperparameter, demonstrating how model behavior varies based on the degree of data sharing between training stages.

Introduction

The exploration of post-training hyperparameters has become increasingly vital in the research of artificial intelligence and machine learning. One innovative approach involves leveraging the interplay between Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). This study aims to shed light on the critical aspect of data overlap between these two stages and its implications for model performance.

Methodology

In our controlled ablation study, we focused on the Qwen3-8B model, which was subject to varying training conditions. The configurations were designed to measure the impact of data overlap on the model’s performance, specifically in the context of Lean 4 autoformalization. The conditions included:

  • Base Model
  • SFT-only
  • GRPO-only
  • SFT + GRPO (0% overlap)
  • SFT + GRPO (30% overlap)
  • SFT + GRPO (100% overlap)

Results

The results revealed that keeping SFT and GRPO data distinct consistently yielded better performance compared to configurations with full overlap. Notably, under the condition of 0% overlap, the GRPO stage demonstrated a 10.4 percentage point gain in semantic accuracy over SFT alone when evaluated on the Gaokao dataset. In contrast, when the overlap reached 100%, both semantic and compilation accuracy metrics plateaued, indicating that the GRPO stage became redundant.

Discussion

These findings suggest a significant relationship between data overlap and model efficacy. The dual-metric evaluation highlighted discrepancies in performance that are not apparent under traditional compile-only assessments. The observed compile-semantic gaps exceeded 30 percentage points, emphasizing the necessity for thorough evaluation methods in model training.

Conclusion

In conclusion, this research represents a pioneering step in understanding SFT-GRPO data overlap as a post-training hyperparameter. The implications of these findings are profound, providing a framework for optimizing training strategies and enhancing model performance in future AI applications.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.