FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation
arXiv:2505.20353v3
Type: replace-cross
Abstract
Diffusion Transformers (DiT) have emerged as powerful generative models, offering impressive results in various applications. However, their computational intensity poses significant challenges. The iterative nature of these models, coupled with their deep transformer stacks, leads to high resource consumption during inference.
To address these inefficiencies, we introduce FastCache, a hidden-state-level caching and compression framework. FastCache is designed to accelerate DiT inference by leveraging redundancy present within the internal representations of the model.
Key Features of FastCache
FastCache employs a dual strategy to enhance performance:
- Spatial-aware Token Selection: This mechanism adaptively filters redundant tokens based on hidden-state saliency, ensuring that only the most relevant information is processed.
- Transformer-level Cache: By reusing latent activations across timesteps, FastCache minimizes unnecessary computations when changes in the data fall below a predefined threshold.
Performance and Evaluation
The combination of these modules not only reduces computational demands but also preserves the fidelity of generated outputs. Theoretical analyses reveal that FastCache maintains a bounded approximation error through a hypothesis-testing-based decision rule. This means that even with accelerated processing, the quality of the generative outputs remains high.
Empirical evaluations conducted across multiple variants of DiT have demonstrated that FastCache leads to substantial reductions in both latency and memory usage. Notably, it achieves the highest generation quality compared to existing caching methods, as measured by Fréchet Inception Distance (FID) and temporal Fréchet Inception Distance (t-FID).
Token Merging Module
To further enhance the speedup capabilities of FastCache, we have also introduced a token merging module. This module merges redundant tokens based on k-nearest neighbor (k-NN) density, further optimizing the processing pipeline and improving overall efficiency.
Conclusion
FastCache represents a significant advancement in the field of generative modeling with Diffusion Transformers. By effectively managing computation through intelligent caching and token selection, it provides a robust solution to the inefficiencies that have traditionally plagued these models. For those interested in exploring the implementation, the code is available at https://github.com/NoakLiu/FastCache-xDiT.
