CRAFT: Scalable Video Diffusion for Bimanual Robot Data

CRAFT: Video Diffusion for Bimanual Robot Data Generation

Summary: arXiv:2604.03552v1 Announce Type: cross

Abstract

Bimanual robot learning from demonstrations is fundamentally limited by the cost and narrow visual diversity of real-world data, which constrains policy robustness across viewpoints, object configurations, and embodiments. We present Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT), a video diffusion-based framework for scalable bimanual demonstration generation that synthesizes temporally coherent manipulation videos while producing action labels.

Introduction

In the field of robotics, particularly in bimanual manipulation, the need for diverse and high-quality training data is paramount. Traditional methods of collecting this data often involve extensive real-world demonstrations which can be costly and time-consuming. CRAFT addresses these challenges by utilizing a novel video diffusion framework to generate synthetic training data that is both diverse and photorealistic.

Methodology

CRAFT employs a video diffusion model that is conditioned on edge-based structural cues extracted from simulator-generated trajectories. This innovative approach allows for the production of physically plausible variations in robot trajectories. The framework supports a comprehensive augmentation pipeline that includes:

Object pose changes
Camera viewpoint adjustments
Lighting and background variations
Cross-embodiment transfer
Multi-view synthesis

Implementation

By leveraging a pre-trained video diffusion model, CRAFT effectively converts simulated videos into action-consistent demonstrations. This conversion process utilizes action labels derived from simulation trajectories, enabling the framework to create a large and visually diverse dataset from only a few real-world demonstrations.

Results

Through extensive testing across both simulated and real-world bimanual tasks, CRAFT has shown significant improvements in success rates compared to existing augmentation strategies and conventional data scaling methods. The results indicate that diffusion-based video generation can drastically enhance demonstration diversity and bolster generalization for dual-arm manipulation tasks.

Conclusion

CRAFT represents a significant advancement in the field of robot learning from demonstrations, offering a scalable and efficient solution for generating diverse training data. The ability to synthesize high-quality manipulation videos not only reduces the reliance on real-world data collection but also enhances the learning process for bimanual robots. For more information, visit our project website at CRAFT Project.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

CRAFT: Scalable Video Diffusion for Bimanual Robot Data

CRAFT: Video Diffusion for Bimanual Robot Data Generation

Abstract

Introduction

Methodology

Implementation

Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related