Tiny Inference-Time Scaling with Latent Verifiers
Summary: arXiv:2603.22492v2 Announce Type: replace-cross
Inference-time scaling has emerged as an effective method to enhance the performance of generative models during test time. This is achieved by employing a verifier to score and select candidate outputs. A prevalent choice in this domain is the use of Multimodal Large Language Models (MLLMs) as verifiers. While MLLMs can lead to improved performance, they also introduce significant inference-time costs.
Diffusion pipelines, which operate in an autoencoder latent space, aim to minimize computation. However, MLLM verifiers necessitate decoding candidates to pixel space and subsequently re-encoding them into the visual embedding space. This process inevitably results in redundant and costly operations.
Introducing Verifier on Hidden States (VHS)
In light of these challenges, we propose a novel approach called Verifier on Hidden States (VHS). This innovative verifier operates directly on the intermediate hidden representations produced by Diffusion Transformer (DiT) single-step generators. By analyzing generator features without the need to decode to pixel space, VHS significantly reduces the per-candidate verification cost while either improving or matching the performance of existing MLLM-based competitors.
Performance Improvements
Our research demonstrates that VHS excels under tiny inference budgets, where only a small number of candidates are evaluated per prompt. Notably, VHS allows for more efficient inference-time scaling, achieving remarkable reductions in various metrics:
- Joint generation-and-verification time is reduced by 63.3%.
- Compute FLOPs are decreased by 51%.
- VRAM usage is cut down by 14.5%.
Moreover, despite these optimizations, VHS achieves a +2.7% improvement on GenEval at the same inference-time budget, underscoring its effectiveness in enhancing generative model performance.
Conclusion
The introduction of VHS marks a significant advancement in the realm of inference-time scaling for generative models. By bypassing the need for pixel-space decoding and directly leveraging hidden states, VHS not only streamlines the verification process but also enhances model performance. This approach holds promise for future developments in efficient generative modeling, paving the way for more sophisticated applications in various fields, including image generation, natural language processing, and beyond.
