Evolutionary Token-Level Prompt Optimization for Diffusion Models
Summary: arXiv:2604.09861v1 Announce Type: new
Abstract
Text-to-image diffusion models exhibit strong generative performance but remain highly sensitive to prompt formulation, often requiring extensive manual trial and error to obtain satisfactory results. This motivates the development of automated, model-agnostic prompt optimization methods that can systematically explore the conditioning space beyond conventional text rewriting.
Introduction
The advent of text-to-image diffusion models has revolutionized the field of generative art and machine learning. However, one of the significant challenges faced by practitioners is the sensitivity of these models to prompt formulations. The quality of the generated images can drastically change based on slight variations in the input prompts, necessitating a labor-intensive process of trial and error.
Research Motivation
The need for an automated solution to optimize prompts arises from the desire to streamline the image generation process. Traditional methods primarily focus on rewriting prompts manually, which can be inefficient and time-consuming. This research proposes a novel approach utilizing Genetic Algorithms (GA) for prompt optimization, aiming to enhance the performance of CLIP-based diffusion models.
Methodology
The approach involves evolving token vectors directly, rather than relying solely on text rewriting techniques. The GA optimizes a fitness function that encompasses two main criteria:
- Aesthetic Quality: Measured by the LAION Aesthetic Predictor V2, this criterion evaluates the visual appeal of the generated images.
- Prompt-Image Alignment: Assessed via CLIPScore, this metric determines how well the generated image aligns with the original prompt.
Experimental Results
Experiments conducted on 36 prompts from the Parti Prompts (P2) dataset indicate that the proposed GA-driven optimization method significantly outperforms baseline techniques, including Promptist and random search. The results illustrate an impressive improvement in fitness, with gains of up to 23.93%.
Discussion
The findings suggest that the genetic algorithm approach not only enhances the quality of generated images but also provides a systematic way to explore the vast conditioning space within text-to-image models. The adaptability of this method to various image generation models with tokenized text encoders opens avenues for future research and application.
Limitations and Future Prospects
While the proposed method shows promising results, it is essential to consider its limitations. The reliance on specific aesthetic predictors may not generalize across all use cases. Future work could focus on integrating more diverse metrics for evaluating image quality and expanding the framework to accommodate other generative models.
Conclusion
Overall, the evolutionary token-level prompt optimization method presents significant advancements in the field of text-to-image generation. By automating the prompt optimization process, this research lays the groundwork for more efficient and effective use of diffusion models, ultimately enhancing the creative capabilities of artists and developers alike.
