Gen-n-Val: Agentic Image Data Generation and Validation
In the fast-evolving field of computer vision, data scarcity, label noise, and long-tailed category imbalance continue to pose significant challenges. These issues are particularly pronounced in complex tasks such as object detection and instance segmentation, especially within large-vocabulary benchmarks like LVIS. Here, many categories are represented by only a handful of images, making effective training difficult. Recent advancements in synthetic data generation methods have attempted to address these challenges but often fall short due to problems like multiple objects per mask, inaccurate segmentation, and incorrect category labels.
To overcome these limitations, a novel framework named Gen-n-Val has been introduced. This innovative approach utilizes Layer Diffusion (LD) in conjunction with a Large Language Model (LLM) and a Vision Large Language Model (VLLM) to generate high-quality and diverse instance masks and images specifically designed for object detection and instance segmentation tasks.
Key Features of Gen-n-Val
-
Dual-Agent Framework: Gen-n-Val operates through two key agents:
- LD Prompt Agent: This LLM is responsible for optimizing prompts that guide the LD to produce high-quality foreground single-object images and their corresponding segmentation masks.
- Data Validation Agent: The VLLM filters out low-quality synthetic instance images, ensuring that only the best outputs are retained.
- TextGrad Optimization: The prompts used for both agents are optimized using TextGrad, enhancing the overall effectiveness of the data generation process.
- Significant Performance Improvements: Gen-n-Val demonstrates remarkable enhancements over existing synthetic data generation methods. For instance, it reduces invalid synthetic data from 50% to just 7%.
Performance Metrics
The performance metrics of Gen-n-Val are noteworthy. When evaluated on rare classes in the LVIS instance segmentation using Mask R-CNN, the framework achieved a performance improvement of 7.6%. Similarly, it demonstrated a 3.6% improvement in mean Average Precision (mAP) for rare classes in COCO instance segmentation using YOLOv9c and YOLO11m.
Furthermore, Gen-n-Val significantly outperforms previous models in open-vocabulary object detection benchmarks, achieving a 7.1% mAP improvement over YOLO-Worldv2-M when using YOLO11m.
Scalability and Accessibility
One of the standout features of Gen-n-Val is its scalability. The framework is designed to accommodate increased model capacity and larger dataset sizes, making it a versatile tool for researchers and practitioners in the field. For those interested in exploring this innovative framework, the code is available on GitHub at https://github.com/aiiu-lab/Gen-n-Val.
In conclusion, Gen-n-Val represents a significant advancement in the realm of synthetic data generation for computer vision tasks, addressing long-standing challenges and setting new benchmarks for performance and quality.
