DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
Summary: arXiv:2603.29844v1 Announce Type: cross
Introduction
The recent advancements in Vision-Language-Action (VLA) models have been propelled by the emergence of pre-trained Vision-Language Models (VLMs). These models have redefined the landscape of multimodal processing, yet many existing end-to-end VLAs predominantly utilize VLMs as straightforward multimodal encoders. This conventional approach directly maps vision-language features to low-level actions, failing to exploit the full potential of VLMs in high-level decision-making contexts.
Challenges in Existing Approaches
While the integration of VLMs with action execution has shown promise, it introduces several challenges:
- Underutilization of VLMs: Existing systems primarily focus on low-level actions, neglecting the VLM’s rich semantic capabilities.
- Training Instability: The direct mapping of features to actions can lead to instability, diminishing the quality of learned representations.
- Inadequate Decision-Making: High-level decision-making processes are often overlooked, limiting the model’s effectiveness in complex environments.
Introducing DIAL
To tackle these challenges, we propose DIAL, a novel framework that creates a bridge between high-level decision-making and low-level motor execution through the implementation of a differentiable latent intent bottleneck.
How DIAL Works
DIAL consists of two primary components:
- System-2: This VLM-based component is responsible for latent world modeling. It synthesizes latent visual foresight within the VLM’s native feature space, effectively encoding intent and acting as a structural bottleneck.
- System-1: A lightweight policy that decodes the predicted intent along with the current observation to generate precise robot actions through latent inverse dynamics.
Training Paradigm
To enhance optimization stability, DIAL employs a two-stage training paradigm:
- Decoupled Warmup Phase: In this initial phase, System-2 focuses on predicting latent futures while System-1 learns motor control under ground-truth future guidance, all within a unified feature space.
- End-to-End Joint Optimization: Following the warmup, the systems undergo seamless joint optimization, allowing action-aware gradients to fine-tune the VLM backbone while preserving pre-trained knowledge.
Experimental Results
Extensive experiments were conducted using the RoboCasa GR1 Tabletop benchmark, where DIAL established a new state-of-the-art performance. Notably, it achieved this with 10 times fewer demonstrations compared to prior methods. Furthermore, by utilizing heterogeneous human demonstrations, DIAL effectively learns physically grounded manipulation priors. This results in robust zero-shot generalization capabilities, enabling the model to adapt to unseen objects and novel configurations during real-world deployments on humanoid robots.
Conclusion
DIAL represents a significant advancement in the field of Vision-Language-Action modeling. By decoupling intent and action through latent world modeling, it enhances both the stability and effectiveness of robotic actions in complex environments. The implications of this work could redefine how robots understand and interact with their surroundings.
