DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference
Summary: arXiv:2604.15750v1 Announce Type: cross
Abstract
Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive language generation due to their potential for parallel decoding and global refinement of the entire sequence. To unlock this potential, DLM inference must carefully balance generation quality and decoding speed. Recent block-wise DLM decoding methods improve this trade-off by performing diffusion-based decoding sequentially in blocks.
Introduction
However, existing methods typically rely on fixed block schedules or current-step local signals to determine block boundaries, and use conservative confidence-based parallel decoding to avoid conflicts. This limitation restricts the quality-speed trade-off that can be achieved in DLM inference. In this paper, we introduce DepCap, a training-free framework designed to enhance the efficiency of block-wise DLM inference.
Key Innovations of DepCap
- Adaptive Block Extension: DepCap utilizes the influence of the last decoded block to adaptively determine the extent of the next block, optimizing the decoding process.
- Conflict-Free Token Identification: The framework identifies a conflict-free subset of tokens for safe parallel decoding within each block, significantly accelerating inference without compromising quality.
- Plug-and-Play Compatibility: DepCap is designed to be easily integrated into various DLM architectures and is compatible with existing key-value (KV) cache strategies.
Information-Theoretic Analysis
Our analysis suggests that the cumulative influence of the last decoded block on a candidate block is approximately additive across tokens. This finding supports the proposed criteria for block partitioning, ensuring that the adaptive mechanism enhances both speed and quality during the decoding process.
Experimental Results
The experimental evaluations demonstrate that DepCap achieves favorable speed-quality trade-offs across multiple DLM backbones and various reasoning and coding benchmarks. Notably, the framework delivers up to a 5.63x speedup in inference times while maintaining performance levels that are not significantly degraded.
Conclusion
In conclusion, DepCap represents a significant advancement in the field of diffusion language models by addressing the limitations of existing block-wise DLM decoding methods. By leveraging adaptive signals for block boundary determination and enabling conflict-free parallel decoding, DepCap optimizes the balance between speed and quality in DLM inference. This work paves the way for more efficient implementations of DLMs in practical applications.
