IMAGAgent: Orchestrating Multi-Turn Image Editing via Constraint-Aware Planning and Reflection
Summary: arXiv:2603.29602v1 Announce Type: cross
Abstract: Existing multi-turn image editing paradigms are often confined to isolated single-step execution. Due to a lack of context-awareness and closed-loop feedback mechanisms, they are prone to error accumulation and semantic drift during multi-turn interactions, ultimately resulting in severe structural distortion of the generated images. For that, we propose IMAGAgent, a multi-turn image editing agent framework based on a “plan-execute-reflect” closed-loop mechanism that achieves deep synergy among instruction parsing, tool scheduling, and adaptive correction within a unified pipeline.
Introduction
In recent years, the demand for advanced image editing capabilities has surged, driven by the proliferation of social media and digital content creation. Traditional image editing tools, however, often lack the sophistication required for multi-turn interactions, leading to inefficiencies and inaccuracies. IMAGAgent addresses these challenges through an innovative approach that integrates various components of image editing into a cohesive system.
Key Features of IMAGAgent
- Constraint-Aware Planning: The foundation of IMAGAgent is its planning module, which utilizes a vision-language model (VLM) to decompose complex instructions into manageable sub-tasks. This process is governed by three key principles: target singularity, semantic atomicity, and visual perceptibility.
- Tool-Chain Orchestration: IMAGAgent dynamically constructs execution paths based on the current image and historical context. This capability allows for adaptive scheduling and seamless collaboration among various operation models, including image retrieval, segmentation, detection, and editing.
- Multi-Expert Collaborative Reflection: A central large language model (LLM) plays a critical role in synthesizing critiques from the VLM, providing holistic feedback to the editing process. This feedback loop not only facilitates fine-grained self-correction but also enhances future decision-making by recording outcomes.
Experimental Validation
To evaluate the effectiveness of IMAGAgent, extensive experiments were conducted using the newly constructed MTEditBench and the MagicBrush dataset. The results demonstrated that IMAGAgent significantly outperforms existing methods in several key metrics:
- Instruction Consistency: IMAGAgent maintains high fidelity to user instructions across multiple editing iterations.
- Editing Precision: The framework achieves remarkable accuracy in executing complex editing tasks.
- Overall Quality: The final images produced exhibit superior quality with fewer distortions and artifacts.
Conclusion
IMAGAgent represents a significant advancement in the field of image editing, offering a robust solution to the challenges posed by multi-turn interactions. By integrating planning, execution, and reflection within a single framework, it sets a new standard for image editing tools. The code for IMAGAgent is publicly available, allowing researchers and developers to build upon this innovative framework. For more information, visit the GitHub repository.
