StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback
Summary: arXiv:2510.20093v2 Announce Type: replace-cross
Abstract
Recent advancements in diffusion models have significantly enriched the quality of generated images; however, challenges remain in synthesizing pixel-based human-drawn sketches, which serve as a representative example of abstract expression. Addressing these challenges, we propose StableSketcher, a novel framework designed to empower diffusion models to generate hand-drawn sketches with high prompt fidelity.
Key Features of StableSketcher
Within the StableSketcher framework, several critical components work in tandem to achieve improved sketch generation:
- Variational Autoencoder Fine-Tuning: We fine-tune the variational autoencoder to optimize latent decoding, enhancing its ability to capture the unique characteristics of sketches.
- Reinforcement Learning Integration: A new reward function for reinforcement learning, based on visual question answering, is integrated to improve text-image alignment and semantic consistency.
- Enhanced Stylization: Extensive experiments reveal that StableSketcher generates sketches with improved stylistic fidelity, achieving better alignment with prompts compared to the existing Stable Diffusion baseline.
Introduction of SketchDUO
To further support the development and evaluation of sketch generation, we introduce SketchDUO, which, to the best of our knowledge, is the first dataset comprising instance-level sketches paired with captions and question-answer pairs. This innovation addresses the limitations of existing datasets that primarily rely on image-label pairs, thereby providing a more robust framework for training and evaluating sketch generation models.
Experimental Results
Through extensive experiments, the capabilities of StableSketcher have been thoroughly assessed. Our results indicate a marked improvement in the quality of generated sketches, demonstrating superior performance in terms of stylistic fidelity and alignment with the given prompts. By leveraging the visual question answering feedback mechanism, the framework ensures that the generated sketches not only retain artistic integrity but also align closely with the intended semantic meaning of the prompts.
Conclusion
In conclusion, StableSketcher represents a significant advancement in the domain of sketch generation through diffusion models. The integration of variational autoencoder fine-tuning and reinforcement learning based on visual question answering feedback has proven effective in enhancing both the quality and fidelity of generated sketches. As we move forward, we are committed to making our code and dataset publicly available upon acceptance, fostering further research and innovation in this exciting area of artificial intelligence.
Project Page
For more information, please visit our project page: StableSketcher Project Page.
