HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation
Summary: arXiv:2604.18791v1 Announce Type: cross
Abstract
Recent advancements in Vision-Language-Action (VLA) models have demonstrated significant efficacy in short-horizon manipulation tasks. However, these models systematically fail when faced with long-horizon tasks, revealing a critical gap in current methodologies. This challenge is not merely a function of extending the context length but is rooted in three persistent execution-loop deficiencies: the memory gap, the verification gap, and the recovery gap.
Introduction to HELM
In response to these deficiencies, we introduce HELM, a model-agnostic framework designed to enhance long-horizon manipulation capabilities in VLA models. HELM incorporates three innovative components:
- Episodic Memory Module (EMM): This module retrieves essential task history by utilizing CLIP-indexed keyframes to provide contextual awareness.
- State Verifier (SV): A learned mechanism that predicts potential action failures prior to execution, based on a combination of observations, actions, subgoals, and memory-conditioned context.
- Harness Controller (HC): This component facilitates rollback and replanning, enabling the system to adapt dynamically to unforeseen issues during task execution.
The State Verifier: A Core Contribution
The State Verifier stands out as the core learning contribution of HELM. Our empirical evaluations demonstrate that the SV consistently outperforms traditional rule-based feasibility checks and ensemble uncertainty baselines. Its efficacy is critically dependent on access to the episodic memory, which informs its decision-making process.
Performance Improvements
Our evaluation on the LIBERO-LONG benchmark showcases that HELM significantly enhances the task success rate, achieving an increase of 23.1 percentage points over the OpenVLA model, raising the success rate from 58.4% to an impressive 81.5%. In contrast, merely extending the context window to H=32 yields a modest 5.4-point improvement, while same-budget LoRA adaptation remains at 12.2 points below HELM’s performance.
Enhancements Across Various Tasks
HELM not only excels on LIBERO-LONG but also enhances long-horizon performance on the CALVIN task. Additionally, it demonstrates substantial improvements in recovery success rates when subjected to controlled perturbations. Our comprehensive set of ablations and mechanism analyses further isolate the contributions of each HELM component, validating the framework’s robustness.
Introducing LIBERO-Recovery
As part of our commitment to advancing research in this field, we are excited to release LIBERO-Recovery, a novel perturbation-injection protocol designed for evaluating failure recovery in long-horizon manipulation tasks. This resource aims to facilitate further exploration and improvement of VLA capabilities.
Conclusion
In summary, HELM presents a significant leap forward in addressing the long-standing challenges faced by VLA models in manipulation tasks. By effectively bridging the gaps identified in existing methodologies, HELM lays the groundwork for more resilient and capable vision-language-action systems.
