Boost Text-to-Image Diffusion with Split-Text Conditioning

Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning

Summary: arXiv:2505.19261v2
Announce Type: replace-cross

Abstract

Current text-to-image diffusion generation typically employs complete-text conditioning. Due to the intricate syntax, diffusion transformers (DiTs) inherently suffer from a comprehension defect of complete-text captions. One-fly complete-text input either overlooks critical semantic details or causes semantic confusion by simultaneously modeling diverse semantic primitive types. To mitigate this defect of DiTs, we propose a novel split-text conditioning framework named DiT-ST.

Introduction to DiT-ST

This framework converts a complete-text caption into a split-text caption, which is a collection of simplified sentences designed to explicitly express various semantic primitives and their interconnections. The split-text caption is then injected into different denoising stages of DiT-ST in a hierarchical and incremental manner.

Key Features of DiT-ST

Large Language Models Integration: DiT-ST leverages Large Language Models to parse captions, extracting diverse primitives and hierarchically sorting and constructing these primitives into a split-text input.
Partitioned Denoising Process: We partition the diffusion denoising process according to its differential sensitivities to diverse semantic primitive types.
Incremental Token Injection: The framework determines appropriate timesteps to incrementally inject tokens of diverse semantic primitive types into input tokens using cross-attention.

Benefits of Split-Text Conditioning

By utilizing split-text conditioning, DiT-ST enhances the representation learning of specific semantic primitive types across different stages of the diffusion process. This approach not only improves the accuracy of the generated images but also ensures that critical semantic details are preserved and represented effectively.

Experimental Validation

Extensive experiments have been conducted to validate the effectiveness of our proposed DiT-ST framework. The results indicate a significant improvement in the handling of complete-text comprehension defects compared to traditional methods.

Conclusion

The introduction of the DiT-ST framework marks a substantial advancement in text-to-image generation technologies. By addressing the inherent limitations of complete-text conditioning, DiT-ST offers a more robust and nuanced approach to image synthesis, ultimately enhancing the capabilities of diffusion transformers.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Boost Text-to-Image Diffusion with Split-Text Conditioning

Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning

Abstract

Introduction to DiT-ST

Key Features of DiT-ST

Benefits of Split-Text Conditioning

Experimental Validation

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related