3D Instruction Ambiguity Detection
Summary: arXiv:2601.05991v2 Announce Type: replace
Abstract: In safety-critical domains, linguistic ambiguity can have severe consequences; a vague command like “Pass me the vial” in a surgical setting could lead to catastrophic errors. Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation. To address this critical safety gap, we are the first to define 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene.
To support this research, we build Ambi3D, the large-scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions. Our analysis reveals a surprising limitation: state-of-the-art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous. To address this challenge, we propose AmbiVer, a two-stage framework that collects explicit visual evidence from multiple views and uses it to guide a vision-language model (VLM) in judging instruction ambiguity.
Key Findings
- 3D Instruction Ambiguity Detection is essential in safety-critical environments.
- Ambi3D is a large-scale benchmark with over 700 3D scenes and approximately 22,000 instructions.
- Current state-of-the-art 3D LLMs have difficulty in accurately determining instruction ambiguity.
- AmbiVer, a proposed two-stage framework, effectively enhances the ambiguity detection capabilities of VLMs.
Importance of the Research
The importance of this research cannot be overstated, particularly in fields like healthcare, manufacturing, and autonomous systems where precise communication is crucial. Linguistic ambiguity can lead to misunderstandings, potentially resulting in harmful outcomes. By introducing the concept of 3D Instruction Ambiguity Detection, this work aims to pave the way for the development of AI systems that can interpret instructions with greater clarity and reliability.
Future Implications
Our findings highlight the need for further advancements in AI models that can process and understand nuanced language in context. The introduction of AmbiVer not only addresses existing limitations but also sets a new standard for future research in embodied AI. As we continue to refine these technologies, the implications for safety, efficiency, and trustworthiness in AI-assisted environments will be significant.
Accessing the Research
For those interested in exploring this groundbreaking work further, the code and dataset are available at the following link: Ambi3D Official Site.
