CRAFT: Scalable Video Diffusion for Bimanual Robot Data

Date:

CRAFT: Video Diffusion for Bimanual Robot Data Generation

Summary: arXiv:2604.03552v1 Announce Type: cross

Abstract

Bimanual robot learning from demonstrations is fundamentally limited by the cost and narrow visual diversity of real-world data, which constrains policy robustness across viewpoints, object configurations, and embodiments. We present Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT), a video diffusion-based framework for scalable bimanual demonstration generation that synthesizes temporally coherent manipulation videos while producing action labels.

Introduction

In the field of robotics, particularly in bimanual manipulation, the need for diverse and high-quality training data is paramount. Traditional methods of collecting this data often involve extensive real-world demonstrations which can be costly and time-consuming. CRAFT addresses these challenges by utilizing a novel video diffusion framework to generate synthetic training data that is both diverse and photorealistic.

Methodology

CRAFT employs a video diffusion model that is conditioned on edge-based structural cues extracted from simulator-generated trajectories. This innovative approach allows for the production of physically plausible variations in robot trajectories. The framework supports a comprehensive augmentation pipeline that includes:

  • Object pose changes
  • Camera viewpoint adjustments
  • Lighting and background variations
  • Cross-embodiment transfer
  • Multi-view synthesis

Implementation

By leveraging a pre-trained video diffusion model, CRAFT effectively converts simulated videos into action-consistent demonstrations. This conversion process utilizes action labels derived from simulation trajectories, enabling the framework to create a large and visually diverse dataset from only a few real-world demonstrations.

Results

Through extensive testing across both simulated and real-world bimanual tasks, CRAFT has shown significant improvements in success rates compared to existing augmentation strategies and conventional data scaling methods. The results indicate that diffusion-based video generation can drastically enhance demonstration diversity and bolster generalization for dual-arm manipulation tasks.

Conclusion

CRAFT represents a significant advancement in the field of robot learning from demonstrations, offering a scalable and efficient solution for generating diverse training data. The ability to synthesize high-quality manipulation videos not only reduces the reliance on real-world data collection but also enhances the learning process for bimanual robots. For more information, visit our project website at CRAFT Project.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.