3D Vision-Language Masks for Long-Horizon Box Rearrangement

Date:

Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement

Summary: arXiv:2603.23676v1 Announce Type: new

Abstract: We study long-horizon planning in 3D environments from under-specified natural-language goals using only visual observations, focusing on multi-step 3D box rearrangement tasks. Existing approaches typically rely on symbolic planners with brittle relational grounding of states and goals, or on direct action-sequence generation from 2D vision-language models (VLMs). Both approaches struggle with reasoning over many objects, rich 3D geometry, and implicit semantic constraints. Recent advances in 3D VLMs demonstrate strong grounding of natural-language referents to 3D segmentation masks, suggesting the potential for more general planning capabilities.

Introduction

In the realm of robotics and artificial intelligence, the ability to understand and execute complex tasks based on natural-language instructions is crucial. The challenge increases significantly when these tasks involve 3D environments and require multi-step actions, such as rearranging boxes. Traditional methods often fall short, leading to a demand for innovative approaches that leverage the latest advancements in vision-language models.

Proposed Solution: Reactive Action Mask Planner (RAMP-3D)

To address the limitations of existing methodologies, we introduce the Reactive Action Mask Planner (RAMP-3D). RAMP-3D innovatively formulates long-horizon planning as the sequential reactive prediction of paired 3D masks:

  • Which-object mask: This mask indicates the specific object to be manipulated.
  • Which-target-region mask: This mask specifies the location where the object should be placed.

This dual-mask approach enables the system to process RGB-D observations and natural-language task specifications effectively. By generating multi-step pick-and-place actions, RAMP-3D demonstrates a significant advancement in the execution of 3D box rearrangement tasks.

Experimental Results

We evaluated RAMP-3D across 11 task variants set in warehouse-style environments, encompassing scenarios with 1 to 30 boxes and diverse natural-language constraints. The results were promising:

  • RAMP-3D achieved a remarkable success rate of 79.5% on long-horizon rearrangement tasks.
  • The system significantly outperformed traditional 2D VLM-based baselines, showcasing its effectiveness in complex planning scenarios.

Conclusion

The introduction of mask-based reactive policies presents a compelling alternative to conventional symbolic pipelines for long-horizon planning. RAMP-3D leverages recent advancements in 3D vision-language models, establishing a new benchmark in the field of robotic manipulation and planning. As we continue to refine these technologies, the potential applications in real-world scenarios—ranging from warehouse automation to assistive robotics—are both vast and exciting.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.