Efficient Inference-Time Scaling with Latent Verifiers

Date:

Tiny Inference-Time Scaling with Latent Verifiers

Summary: arXiv:2603.22492v2 Announce Type: replace-cross

Inference-time scaling has emerged as an effective method to enhance the performance of generative models during test time. This is achieved by employing a verifier to score and select candidate outputs. A prevalent choice in this domain is the use of Multimodal Large Language Models (MLLMs) as verifiers. While MLLMs can lead to improved performance, they also introduce significant inference-time costs.

Diffusion pipelines, which operate in an autoencoder latent space, aim to minimize computation. However, MLLM verifiers necessitate decoding candidates to pixel space and subsequently re-encoding them into the visual embedding space. This process inevitably results in redundant and costly operations.

Introducing Verifier on Hidden States (VHS)

In light of these challenges, we propose a novel approach called Verifier on Hidden States (VHS). This innovative verifier operates directly on the intermediate hidden representations produced by Diffusion Transformer (DiT) single-step generators. By analyzing generator features without the need to decode to pixel space, VHS significantly reduces the per-candidate verification cost while either improving or matching the performance of existing MLLM-based competitors.

Performance Improvements

Our research demonstrates that VHS excels under tiny inference budgets, where only a small number of candidates are evaluated per prompt. Notably, VHS allows for more efficient inference-time scaling, achieving remarkable reductions in various metrics:

  • Joint generation-and-verification time is reduced by 63.3%.
  • Compute FLOPs are decreased by 51%.
  • VRAM usage is cut down by 14.5%.

Moreover, despite these optimizations, VHS achieves a +2.7% improvement on GenEval at the same inference-time budget, underscoring its effectiveness in enhancing generative model performance.

Conclusion

The introduction of VHS marks a significant advancement in the realm of inference-time scaling for generative models. By bypassing the need for pixel-space decoding and directly leveraging hidden states, VHS not only streamlines the verification process but also enhances model performance. This approach holds promise for future developments in efficient generative modeling, paving the way for more sophisticated applications in various fields, including image generation, natural language processing, and beyond.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.