E0: Enhancing Generalization and Fine-Grained Control in VLA Models via Tweedie Discrete Diffusion
Summary: arXiv:2511.21542v2 Announce Type: replace-cross
Introduction
Vision-Language-Action (VLA) models serve as a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. Despite their promise, existing VLA systems still face significant challenges in generalizing across diverse tasks, scenes, and camera viewpoints. Furthermore, they often produce actions that are either coarse or unstable.
Challenges in Current VLA Systems
The limitations observed in current VLA systems can be attributed to several structural properties associated with actions in VLA settings:
- Multi-Peaked Nature of Action Distributions: Actions are often represented by distributions that exhibit multiple peaks, complicating the decision-making process.
- Token-Based Symbolic Reasoning: The pretrained vision-language models (VLMs) and VLA backbones utilize token-based reasoning, which may not effectively capture continuous action spaces.
- Finite Resolution in Robotic Control: Real-world robotic control imposes a finite resolution on actions, leading to challenges in executing precise control commands.
Introducing E0: A Tweedie Discrete Diffusion Framework
To address the aforementioned challenges, we introduce E0, a tweedie discrete diffusion framework designed to enhance action generation in VLA models. E0 formulates action generation as an iterative denoising process over quantized action tokens. By focusing on a discrete action space, E0 aligns more naturally with token-based reasoning, enabling:
- Fine-Grained Control: The framework supports the generation of fine-grained yet executable actions, overcoming the limitations of existing models.
- Avoidance of Distributional Mismatch: Unlike traditional masking-based discrete diffusion methods, E0 reduces distributional mismatches.
Robustness through Viewpoint Perturbation Augmentation
In addition to introducing E0, we also present a novel spherical viewpoint perturbation augmentation technique. This approach enhances robustness against variations in camera angles and viewpoints, ensuring more reliable performance across different environments without the need for additional training data.
Experimental Results
We conducted extensive experiments using various benchmarks, including LIBERO, VLABench, and ManiSkill, as well as real-world robotic applications utilizing a Franka arm. The results demonstrate that E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baseline models by an impressive average margin of 10.7%.
Conclusion
The introduction of E0 marks a significant advancement in the field of Vision-Language-Action models, providing enhanced generalization capabilities and fine-grained control. By leveraging a tweedie discrete diffusion framework and innovative viewpoint perturbation techniques, E0 is poised to set new standards in robotic manipulation and action generation.
