Boost Text-to-Image Diffusion with Split-Text Conditioning

Date:


Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning

Summary: arXiv:2505.19261v2
Announce Type: replace-cross

Abstract

Current text-to-image diffusion generation typically employs complete-text conditioning. Due to the intricate syntax, diffusion transformers (DiTs) inherently suffer from a comprehension defect of complete-text captions. One-fly complete-text input either overlooks critical semantic details or causes semantic confusion by simultaneously modeling diverse semantic primitive types. To mitigate this defect of DiTs, we propose a novel split-text conditioning framework named DiT-ST.

Introduction to DiT-ST

This framework converts a complete-text caption into a split-text caption, which is a collection of simplified sentences designed to explicitly express various semantic primitives and their interconnections. The split-text caption is then injected into different denoising stages of DiT-ST in a hierarchical and incremental manner.

Key Features of DiT-ST

  • Large Language Models Integration: DiT-ST leverages Large Language Models to parse captions, extracting diverse primitives and hierarchically sorting and constructing these primitives into a split-text input.
  • Partitioned Denoising Process: We partition the diffusion denoising process according to its differential sensitivities to diverse semantic primitive types.
  • Incremental Token Injection: The framework determines appropriate timesteps to incrementally inject tokens of diverse semantic primitive types into input tokens using cross-attention.

Benefits of Split-Text Conditioning

By utilizing split-text conditioning, DiT-ST enhances the representation learning of specific semantic primitive types across different stages of the diffusion process. This approach not only improves the accuracy of the generated images but also ensures that critical semantic details are preserved and represented effectively.

Experimental Validation

Extensive experiments have been conducted to validate the effectiveness of our proposed DiT-ST framework. The results indicate a significant improvement in the handling of complete-text comprehension defects compared to traditional methods.

Conclusion

The introduction of the DiT-ST framework marks a substantial advancement in text-to-image generation technologies. By addressing the inherent limitations of complete-text conditioning, DiT-ST offers a more robust and nuanced approach to image synthesis, ultimately enhancing the capabilities of diffusion transformers.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.