SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models
In recent advancements in artificial intelligence, a new post-training pipeline for diffusion models has emerged, known as SOAR (Self-Correction for Optimal Alignment and Refinement). This innovative approach addresses the challenges faced by traditional methods, such as supervised fine-tuning (SFT) and reinforcement learning (RL), which are crucial for enhancing the performance of diffusion models.
Understanding the Current Landscape
The existing framework for training diffusion models typically involves two primary stages: SFT on curated datasets and RL utilizing reward models. While SFT focuses on optimizing the denoiser using ground-truth states sampled from the forward noising process, it has significant limitations. Once the model’s inference strays from these ideal states, it relies heavily on out-of-distribution generalization, leading to exposure bias—a problem that also plagues autoregressive models.
The Challenges of Reinforcement Learning
Although RL has the potential to address the identified mismatch between SFT and the real-world application of diffusion models, it comes with its own set of challenges. The terminal reward signal in RL is often sparse, which complicates the credit-assignment problem. This can lead to reward hacking, where the model learns to exploit the reward system rather than genuinely improve its performance.
Introducing SOAR
To bridge the gap between SFT and RL, researchers have proposed SOAR, a bias-correction post-training method. SOAR operates by starting with a real sample and performing a single stop-gradient rollout using the current model. It then re-noises the resulting off-trajectory state and supervises the model to guide it back to the original clean target. This approach is notable for being on-policy, reward-free, and providing dense per-timestep supervision, effectively eliminating the credit-assignment issue.
Performance and Improvements
The results of implementing SOAR are promising. In experiments conducted on the SD3.5-Medium dataset, SOAR significantly improved GenEval scores from 0.70 to 0.78 and OCR scores from 0.64 to 0.67 compared to traditional SFT. Furthermore, it was observed that SOAR raised all model-based preference scores, indicating a comprehensive enhancement in model performance.
Comparative Analysis
In controlled experiments focused on specific rewards, SOAR outperformed Flow-GRPO across all final metric values for both aesthetic and text-image alignment tasks, despite not relying on a reward model. This demonstrates SOAR’s effectiveness in achieving superior results through its innovative methodology.
Conclusion and Future Implications
As SOAR’s base loss function incorporates the standard SFT objective, it stands as a robust alternative for the first post-training stage following pretraining. Moreover, SOAR is fully compatible with subsequent RL alignment processes, paving the way for more sophisticated and effective diffusion models in the future. The introduction of SOAR represents a significant step forward in the quest for optimal alignment and refinement in AI, promising to enhance the capabilities of diffusion models across various applications.
