Improving Compositional Image Synthesis with Scaling

Compositional Image Synthesis with Inference-Time Scaling

Summary: arXiv:2510.24133v2 Announce Type: replace-cross

Abstract: Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality.

In recent years, text-to-image generation has made significant strides in producing visually appealing and realistic images. However, one critical area of concern remains: compositionality. Current models often falter in accurately representing the number of objects, their attributes, and spatial relationships within a scene. This limitation hinders their usability in applications requiring precise visual representations.

Introducing a Novel Framework

To tackle these challenges, our research introduces a novel framework that enhances the compositional capabilities of text-to-image models by leveraging two main components:

Object-Centric Approach: By focusing on individual objects, our framework ensures that each element in the generated image is treated with precision, allowing for better adherence to the input prompts.
Self-Refinement Mechanism: This mechanism iteratively refines the output, enhancing layout accuracy while maintaining visual quality.

Leveraging Large Language Models

At the core of our framework is a robust utilization of large language models (LLMs). These models play a crucial role in synthesizing explicit layouts based on input prompts. The process begins with the LLM generating a structured layout that outlines the desired composition of the image.

Once the layout is established, it is integrated into the image generation process. Here, an object-centric vision-language model (VLM) is employed to evaluate multiple candidate images. The VLM iteratively reranks these candidates, selecting the image that best aligns with the prompt. This dual approach of layout grounding and self-refinement significantly enhances the fidelity of the generated images to the original text prompts.

Results and Comparisons

We conducted extensive evaluations to compare our proposed framework with existing state-of-the-art text-to-image models. The results demonstrate that our approach achieves:

Improved scene alignment with input prompts.
Higher accuracy in object counts and attributes.
Enhanced spatial relationships between objects.

These findings underscore the potential of our framework to bridge the gap in compositionality, providing a more reliable solution for applications that require precise visual interpretations of textual descriptions.

Conclusion

In conclusion, our training-free framework for compositional image synthesis represents a significant advancement in the field of text-to-image generation. By integrating an object-centric approach with self-refinement and leveraging the capabilities of large language models, we have developed a system that not only enhances layout fidelity but also preserves aesthetic quality. The code for our framework is available at GitHub – ReFocus for further exploration and application in future projects.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Improving Compositional Image Synthesis with Scaling

Compositional Image Synthesis with Inference-Time Scaling

Introducing a Novel Framework

Leveraging Large Language Models

Results and Comparisons

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related