3D Vision-Language Masks for Long-Horizon Box Rearrangement

Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement

Summary: arXiv:2603.23676v1 Announce Type: new

Abstract: We study long-horizon planning in 3D environments from under-specified natural-language goals using only visual observations, focusing on multi-step 3D box rearrangement tasks. Existing approaches typically rely on symbolic planners with brittle relational grounding of states and goals, or on direct action-sequence generation from 2D vision-language models (VLMs). Both approaches struggle with reasoning over many objects, rich 3D geometry, and implicit semantic constraints. Recent advances in 3D VLMs demonstrate strong grounding of natural-language referents to 3D segmentation masks, suggesting the potential for more general planning capabilities.

Introduction

In the realm of robotics and artificial intelligence, the ability to understand and execute complex tasks based on natural-language instructions is crucial. The challenge increases significantly when these tasks involve 3D environments and require multi-step actions, such as rearranging boxes. Traditional methods often fall short, leading to a demand for innovative approaches that leverage the latest advancements in vision-language models.

Proposed Solution: Reactive Action Mask Planner (RAMP-3D)

To address the limitations of existing methodologies, we introduce the Reactive Action Mask Planner (RAMP-3D). RAMP-3D innovatively formulates long-horizon planning as the sequential reactive prediction of paired 3D masks:

Which-object mask: This mask indicates the specific object to be manipulated.
Which-target-region mask: This mask specifies the location where the object should be placed.

This dual-mask approach enables the system to process RGB-D observations and natural-language task specifications effectively. By generating multi-step pick-and-place actions, RAMP-3D demonstrates a significant advancement in the execution of 3D box rearrangement tasks.

Experimental Results

We evaluated RAMP-3D across 11 task variants set in warehouse-style environments, encompassing scenarios with 1 to 30 boxes and diverse natural-language constraints. The results were promising:

RAMP-3D achieved a remarkable success rate of 79.5% on long-horizon rearrangement tasks.
The system significantly outperformed traditional 2D VLM-based baselines, showcasing its effectiveness in complex planning scenarios.

Conclusion

The introduction of mask-based reactive policies presents a compelling alternative to conventional symbolic pipelines for long-horizon planning. RAMP-3D leverages recent advancements in 3D vision-language models, establishing a new benchmark in the field of robotic manipulation and planning. As we continue to refine these technologies, the potential applications in real-world scenarios—ranging from warehouse automation to assistive robotics—are both vast and exciting.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

3D Vision-Language Masks for Long-Horizon Box Rearrangement

Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement

Introduction

Proposed Solution: Reactive Action Mask Planner (RAMP-3D)

Experimental Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related