Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning
Summary: arXiv:2505.19261v2
Announce Type: replace-cross
Abstract
Current text-to-image diffusion generation typically employs complete-text conditioning. Due to the intricate syntax, diffusion transformers (DiTs) inherently suffer from a comprehension defect of complete-text captions. One-fly complete-text input either overlooks critical semantic details or causes semantic confusion by simultaneously modeling diverse semantic primitive types. To mitigate this defect of DiTs, we propose a novel split-text conditioning framework named DiT-ST.
Introduction to DiT-ST
This framework converts a complete-text caption into a split-text caption, which is a collection of simplified sentences designed to explicitly express various semantic primitives and their interconnections. The split-text caption is then injected into different denoising stages of DiT-ST in a hierarchical and incremental manner.
Key Features of DiT-ST
- Large Language Models Integration: DiT-ST leverages Large Language Models to parse captions, extracting diverse primitives and hierarchically sorting and constructing these primitives into a split-text input.
- Partitioned Denoising Process: We partition the diffusion denoising process according to its differential sensitivities to diverse semantic primitive types.
- Incremental Token Injection: The framework determines appropriate timesteps to incrementally inject tokens of diverse semantic primitive types into input tokens using cross-attention.
Benefits of Split-Text Conditioning
By utilizing split-text conditioning, DiT-ST enhances the representation learning of specific semantic primitive types across different stages of the diffusion process. This approach not only improves the accuracy of the generated images but also ensures that critical semantic details are preserved and represented effectively.
Experimental Validation
Extensive experiments have been conducted to validate the effectiveness of our proposed DiT-ST framework. The results indicate a significant improvement in the handling of complete-text comprehension defects compared to traditional methods.
Conclusion
The introduction of the DiT-ST framework marks a substantial advancement in text-to-image generation technologies. By addressing the inherent limitations of complete-text conditioning, DiT-ST offers a more robust and nuanced approach to image synthesis, ultimately enhancing the capabilities of diffusion transformers.
