VectorGym: A Multitask Benchmark for SVG Code Generation, Sketching, and Editing
Summary: arXiv:2603.29852v1 Announce Type: cross
In the ever-evolving landscape of artificial intelligence, the introduction of VectorGym marks a significant advancement in the field of Scalable Vector Graphics (SVG). This innovative benchmark suite aims to facilitate the generation, sketching, and editing of SVG with a focus on real-world design workflows. By addressing the limitations of existing benchmarks, VectorGym offers a comprehensive set of tasks that are aligned with professional design practices.
Overview of VectorGym
VectorGym distinguishes itself by providing four distinct tasks, each meticulously crafted with expert human-authored annotations. The benchmark encompasses:
- Sketch2SVG (VG-Sketch): A novel task that translates sketches into SVG representations.
- SVG Editing (VG-Edit): A new dataset that challenges models to perform complex, multi-step edits involving higher-order primitives.
- Text2SVG (VG-Text): A task focused on generating SVG from textual descriptions.
- SVG Captioning (VG-Cap): A task that involves generating captions for SVG images.
Unlike previous benchmarks that predominantly utilized synthetic edits, VectorGym emphasizes the importance of real-world applications by providing gold-standard human annotations. This ensures that the models trained on this benchmark are capable of understanding semantic nuances and design intent, making them more applicable in professional settings.
Methodology and Performance
The development of VectorGym is paired with a multi-task reinforcement learning approach that optimally integrates all four tasks. By leveraging rendering-based rewards, the proposed method builds upon the Generalized Reinforcement Policy Optimization (GRPO) framework with a curriculum learning strategy. This results in the effective training of a Qwen3-VL 8B model, which has demonstrated remarkable performance compared to existing open-source models.
Notably, the Qwen3-VL 8B model outperforms significantly larger models, including Qwen3-VL 235B, and exhibits performance levels that are comparable to GPT-4o. This achievement underlines the efficiency and effectiveness of the VectorGym framework in advancing the capabilities of visual code generation.
Innovative Metrics and Evaluations
To further bridge the gap between machine-generated SVG and human understanding, VectorGym introduces a novel metric known as VLM-as-a-Judge. This metric has been validated through comprehensive human correlation studies, ensuring that it accurately reflects the quality of SVG generation.
An evaluation of frontier Vision-Language Models (VLMs) using VectorGym reveals significant performance gaps, highlighting the need for rigorous frameworks in the field. As such, VectorGym positions itself as a critical resource for researchers and developers aiming to push the boundaries of visual code generation.
Availability
VectorGym is now publicly accessible, providing valuable resources for the research community. Interested parties can find the benchmark on Hugging Face, encouraging further exploration and development in the domain of SVG generation and editing.
