VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
Summary: arXiv:2604.09531v1 Announce Type: cross
Vision-language models (VLMs) are at the forefront of artificial intelligence research, yet they still face significant challenges in visual perception tasks like spatial understanding and viewpoint recognition. A contributing factor to these limitations is the restricted supervision provided by natural image datasets for low-level visual skills. This leads to a critical question in the field: can targeted synthetic supervision generated from task keywords, such as “Depth Order,” effectively address these weaknesses?
Introducing VisionFoundry
To explore this question, researchers have developed VisionFoundry, a task-aware synthetic data generation pipeline. VisionFoundry requires only a task name as input, leveraging the capabilities of large language models (LLMs) to generate relevant questions, answers, and text-to-image (T2I) prompts. The process involves synthesizing images using T2I models and validating their consistency with a proprietary VLM, all without the need for reference images or human annotations.
Creating the VisionFoundry-10K Dataset
Using VisionFoundry, the team constructed the VisionFoundry-10K dataset, which comprises 10,000 image-question-answer triples across ten distinct tasks. This synthetic visual question answering (VQA) dataset serves as a rigorous testing ground for evaluating the performance and capabilities of VLMs.
Performance Improvements
Models trained on the VisionFoundry-10K dataset have demonstrated substantial improvements on various visual perception benchmarks. The results indicate:
- A +7% performance increase on the MMVP benchmark.
- A +10% performance increase on the CV-Bench-3D benchmark.
These enhancements occur while maintaining broader capabilities and showing favorable scaling behaviors as the dataset size increases.
Implications for the Future
The findings from this research suggest that the lack of targeted supervision for specific tasks is a significant contributor to the limitations faced by VLMs in visual perception. Furthermore, the successful implementation of synthetic supervision via VisionFoundry presents a promising pathway toward more systematic training of VLMs.
Conclusion
As the field of artificial intelligence continues to evolve, initiatives like VisionFoundry are crucial for enhancing the abilities of language and vision models. By utilizing synthetic data generation techniques, researchers can overcome existing bottlenecks, leading to more advanced and capable AI systems.
