MULTITEXTEDIT: Benchmarking Multilingual Text-in-Image Editing

MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

In recent years, text-in-image editing has emerged as a crucial capability for visual content creation. However, the existing benchmarks in this area are predominantly focused on English, often merging visual plausibility with semantic accuracy. To address this gap, researchers have introduced MULTITEXTEDIT, a comprehensive benchmark designed to assess the performance of text-in-image editing systems across multiple languages.

MULTITEXTEDIT comprises 3,600 instances that span 12 typologically diverse languages, five distinct visual domains, and seven editing operations. Each language variant of an instance shares a common visual base, and is accompanied by a human-edited reference as well as region masks. This design effectively isolates the language variable, facilitating cross-lingual comparisons that are critical for understanding the capabilities and limitations of various editing systems.

Key Features of MULTITEXTEDIT

Diverse Language Coverage: The benchmark includes languages from various linguistic families, ensuring a wide-ranging evaluation of text-in-image editing capabilities.
Controlled Environment: By standardizing visual elements across language instances, MULTITEXTEDIT provides a reliable framework for assessing how well different systems handle text in various scripts.
Language Fidelity Metric (LSF): A novel metric designed to capture script-level errors that traditional text-matching metrics often overlook. This includes issues like missing diacritics, reversed right-to-left (RTL) order, and mixed-script renderings.
Two-Stage LVM Protocol: The language fidelity metric is scored using a two-stage protocol that first traces the edited target text before evaluating it in isolation. This method achieved a quadratic-weighted kappa of 0.76 when compared to assessments from native-speaker annotators.

Findings from MULTITEXTEDIT Evaluation

The evaluation of 12 open-source and proprietary editing systems using the LSF alongside standard semantic and mask-aware pixel metrics revealed significant cross-lingual degradation across all models tested. The findings indicate that:

The largest degradation was observed in Hebrew and Arabic, while the smallest was noted in Dutch and Spanish.
Issues were primarily concentrated in text accuracy and script fidelity, rather than in broader structural dimensions of the output.
A common mismatch between semantic integrity and pixel fidelity was identified; while outputs maintained global layout and background fidelity, they frequently distorted script-specific forms.

Conclusion

MULTITEXTEDIT represents a significant advancement in the benchmarking of text-in-image editing systems, particularly in the context of cross-lingual performance. By providing a controlled and comprehensive evaluation framework, this benchmark not only highlights the limitations of current systems but also sets a foundation for future research aimed at enhancing multilingual text representation in visual content. As the demand for diverse linguistic capabilities in AI continues to grow, the insights gained from MULTITEXTEDIT will be invaluable for developers and researchers striving to create more inclusive and effective text-in-image editing tools.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

MULTITEXTEDIT: Benchmarking Multilingual Text-in-Image Editing

MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

Key Features of MULTITEXTEDIT

Findings from MULTITEXTEDIT Evaluation

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related