FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Summary: arXiv:2604.06916v1 Announce Type: cross
Abstract: Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden.
To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework.
Proposed Framework: Sol-RL
The Sol-RL framework operates in two distinct stages:
- Stage One: High-throughput NVFP4 rollouts are employed to generate a massive candidate pool, from which a highly contrastive subset is extracted.
- Stage Two: These selected samples are then regenerated in BF16 precision, where the policy is optimized exclusively on this refined set.
By decoupling candidate exploration from policy optimization, Sol-RL effectively integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This innovative approach accelerates the rollout phase while preserving high-fidelity samples for the optimization process.
Performance and Results
We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across three significant diffusion models—SANA, FLUX.1, and SD3.5-L—substantiate that our approach delivers superior alignment performance across multiple metrics.
Remarkably, our method accelerates training convergence by up to 4.64×, effectively unlocking the power of massive rollout scaling at a fraction of the cost. This breakthrough not only showcases the efficiency of the Sol-RL framework but also highlights its potential for future applications in text-to-image diffusion modeling.
Conclusion
In conclusion, the integration of FP4 quantization with a two-stage reinforcement learning framework presents a significant advancement in the field of diffusion models. The Sol-RL approach not only mitigates the computational burden associated with large-scale rollouts but also enhances the alignment between generated images and human preferences. As we move forward, this innovative framework could pave the way for more efficient and effective training methodologies in artificial intelligence.
