Draw-In-Mind: Enhancing Image Editing with Unified AI Models

Date:

Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

Summary: arXiv:2509.01986v4 Announce Type: replace-cross

Abstract: In recent years, integrating multimodal understanding and generation into a single unified model has emerged as a promising paradigm. While this approach achieves strong results in text-to-image (T2I) generation, it still struggles with precise image editing. We attribute this limitation to an imbalanced division of responsibilities. The understanding module primarily functions as a translator that encodes user instructions into semantic conditions, while the generation module must simultaneously act as designer and painter, inferring the original layout, identifying the target editing region, and rendering the new content. This imbalance is counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module.

To address this issue, we introduce Draw-In-Mind (DIM), a dataset comprising two complementary subsets:

  • DIM-T2I: containing 14M long-context image-text pairs to enhance complex instruction comprehension.
  • DIM-Edit: consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits.

In our approach, we connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight two-layer MLP. This configuration allows us to train the model on the proposed DIM dataset, resulting in DIM-4.6B-T2I/Edit. Despite its modest parameter scale, DIM-4.6B-Edit achieves state-of-the-art (SOTA) or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit.

These findings demonstrate that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing. The results indicate that a more balanced approach in the roles of the understanding and generation modules can lead to improved outcomes in tasks requiring precise image manipulation.

Key highlights from the research include:

  • The introduction of a novel dataset that enhances the capabilities of multimodal models.
  • A demonstrated improvement in image editing tasks through a restructured understanding and generation process.
  • Competitive performance on key benchmarks despite the smaller scale of the model compared to its larger counterparts.

The DIM dataset and models are publicly available for further research and development at https://github.com/showlab/DIM. This advancement not only paves the way for better image editing techniques but also sets a precedent for future research in multimodal AI applications.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.