CRAFT: Video Diffusion for Bimanual Robot Data Generation
Summary: arXiv:2604.03552v1 Announce Type: cross
Abstract
Bimanual robot learning from demonstrations is fundamentally limited by the cost and narrow visual diversity of real-world data, which constrains policy robustness across viewpoints, object configurations, and embodiments. We present Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT), a video diffusion-based framework for scalable bimanual demonstration generation that synthesizes temporally coherent manipulation videos while producing action labels.
Introduction
In the field of robotics, particularly in bimanual manipulation, the need for diverse and high-quality training data is paramount. Traditional methods of collecting this data often involve extensive real-world demonstrations which can be costly and time-consuming. CRAFT addresses these challenges by utilizing a novel video diffusion framework to generate synthetic training data that is both diverse and photorealistic.
Methodology
CRAFT employs a video diffusion model that is conditioned on edge-based structural cues extracted from simulator-generated trajectories. This innovative approach allows for the production of physically plausible variations in robot trajectories. The framework supports a comprehensive augmentation pipeline that includes:
- Object pose changes
- Camera viewpoint adjustments
- Lighting and background variations
- Cross-embodiment transfer
- Multi-view synthesis
Implementation
By leveraging a pre-trained video diffusion model, CRAFT effectively converts simulated videos into action-consistent demonstrations. This conversion process utilizes action labels derived from simulation trajectories, enabling the framework to create a large and visually diverse dataset from only a few real-world demonstrations.
Results
Through extensive testing across both simulated and real-world bimanual tasks, CRAFT has shown significant improvements in success rates compared to existing augmentation strategies and conventional data scaling methods. The results indicate that diffusion-based video generation can drastically enhance demonstration diversity and bolster generalization for dual-arm manipulation tasks.
Conclusion
CRAFT represents a significant advancement in the field of robot learning from demonstrations, offering a scalable and efficient solution for generating diverse training data. The ability to synthesize high-quality manipulation videos not only reduces the reliance on real-world data collection but also enhances the learning process for bimanual robots. For more information, visit our project website at CRAFT Project.
