Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement
Summary: arXiv:2603.23676v1 Announce Type: new
Abstract: We study long-horizon planning in 3D environments from under-specified natural-language goals using only visual observations, focusing on multi-step 3D box rearrangement tasks. Existing approaches typically rely on symbolic planners with brittle relational grounding of states and goals, or on direct action-sequence generation from 2D vision-language models (VLMs). Both approaches struggle with reasoning over many objects, rich 3D geometry, and implicit semantic constraints. Recent advances in 3D VLMs demonstrate strong grounding of natural-language referents to 3D segmentation masks, suggesting the potential for more general planning capabilities.
Introduction
In the realm of robotics and artificial intelligence, the ability to understand and execute complex tasks based on natural-language instructions is crucial. The challenge increases significantly when these tasks involve 3D environments and require multi-step actions, such as rearranging boxes. Traditional methods often fall short, leading to a demand for innovative approaches that leverage the latest advancements in vision-language models.
Proposed Solution: Reactive Action Mask Planner (RAMP-3D)
To address the limitations of existing methodologies, we introduce the Reactive Action Mask Planner (RAMP-3D). RAMP-3D innovatively formulates long-horizon planning as the sequential reactive prediction of paired 3D masks:
- Which-object mask: This mask indicates the specific object to be manipulated.
- Which-target-region mask: This mask specifies the location where the object should be placed.
This dual-mask approach enables the system to process RGB-D observations and natural-language task specifications effectively. By generating multi-step pick-and-place actions, RAMP-3D demonstrates a significant advancement in the execution of 3D box rearrangement tasks.
Experimental Results
We evaluated RAMP-3D across 11 task variants set in warehouse-style environments, encompassing scenarios with 1 to 30 boxes and diverse natural-language constraints. The results were promising:
- RAMP-3D achieved a remarkable success rate of 79.5% on long-horizon rearrangement tasks.
- The system significantly outperformed traditional 2D VLM-based baselines, showcasing its effectiveness in complex planning scenarios.
Conclusion
The introduction of mask-based reactive policies presents a compelling alternative to conventional symbolic pipelines for long-horizon planning. RAMP-3D leverages recent advancements in 3D vision-language models, establishing a new benchmark in the field of robotic manipulation and planning. As we continue to refine these technologies, the potential applications in real-world scenarios—ranging from warehouse automation to assistive robotics—are both vast and exciting.
