UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
The evaluation of visual editing models has long been a challenge in the field of artificial intelligence, primarily due to the fragmented nature of existing benchmarks. As various methods and modalities evolve, the need for a unified benchmarking system that allows for fair cross-paradigm comparisons has become increasingly apparent. In this context, we introduce UniEditBench, a new benchmark aimed at streamlining the evaluation of both image and video editing tasks.
Current benchmarks are often tailored to specific paradigms, which complicates the process of making meaningful comparisons across different visual editing models. Furthermore, while video editing continues to gain traction, the absence of reliable evaluation benchmarks in this area has hindered progress. Additionally, many common automatic metrics do not align well with human preferences, making it difficult to accurately assess the quality of visual edits.
Introducing UniEditBench
UniEditBench addresses these challenges by providing a structured and coherent framework that supports both reconstruction-based and instruction-driven visual editing methods. This unified protocol is designed to enhance the comparability of results across various editing paradigms.
Key Features of UniEditBench
- Comprehensive Taxonomy: The benchmark includes a structured taxonomy of nine image operations—Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, and Reorder—as well as eight video operations. This extensive coverage includes challenging compositional tasks, such as counting and spatial reordering.
- Scalable Evaluation: To enable scalable evaluation, UniEditBench utilizes a high-capacity multimodal language model (MLLM) judge, specifically the Qwen3-VL-235B-A22B Instruct model. This model has been distilled into lightweight 4B and 8B evaluators that can provide multi-dimensional scoring.
- Multi-Dimensional Scoring: The evaluators assess various criteria, including structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency for video editing tasks. This comprehensive approach ensures that evaluations are robust and reliable.
Benefits of Distillation
One of the standout features of UniEditBench is the significant reduction in computational and financial costs associated with deploying large MLLMs as evaluators. Experiments have demonstrated that the distilled evaluators not only maintain strong agreement with human judgments but also substantially lower the deployment costs compared to the original teacher model. This makes the benchmarking process more accessible to researchers and developers in the field.
Conclusion
As visual editing methods continue to advance, having a practical and reproducible protocol for benchmarking is crucial. UniEditBench provides a solution that is both unified and cost-effective, making it easier to evaluate modern visual editing methods. The benchmark, along with the associated reward models, is publicly available for researchers and developers at GitHub.
