Enhancing VLMs with 3D Primitives for Spatial Reasoning

3D Primitives are a Spatial Language for VLMs

A recent study published on arXiv, titled “3D Primitives are a Spatial Language for VLMs,” explores the intriguing capabilities of Vision-Language Models (VLMs) in generating executable code for reconstructing 3D scenes. Despite their prowess in this area, these models struggle with simpler spatial questions related to the same images. The research emphasizes the potential of 3D geometric primitives, such as cubes, spheres, and cylinders, as a robust intermediate representation for enhancing spatial understanding in VLMs.

Key Contributions of the Study

The study presents three major contributions that highlight the utility of 3D primitives in improving VLMs:

Introduction of SpatialBabel: This benchmark evaluates fourteen VLMs based on their performance in primitive-based 3D scene reconstruction. The study spans six different scene-code languages, which include various programming languages and declarative formats tailored for 3D primitive scenes. An intriguing finding is that a single model’s object-detection F1 score can fluctuate by as much as 5.7 times depending on the language used.
Development of Code-CoT: The research introduces a novel inference strategy known as Code Chain-of-Thought (Code-CoT). This approach facilitates spatial reasoning by routing it through primitive-based code generation. Code-CoT has shown to enhance the SpatialBabel-QA-Score by as much as 6.4% for primitive scenes. Moreover, it boosts the accuracy of real-photo CV-Bench-3D by 5.0% for VLMs proficient in coding tasks.
Proposal of S³-FT: Self-Supervised Spatial Fine-Tuning (S³-FT) is another significant advancement proposed in the study. This method distills primitive spatial knowledge into general visual reasoning without requiring human labels or a teacher model. By parsing the model’s own Three.js primitive reconstructions into structured annotations, S³-FT enhances the performance of the Qwen3-VL-8B model across various benchmarks, including an improvement of 4.6% to 8.6% on SpatialBabel-Primitive-QA, a 9.7% increase on CV-Bench-2D, and a remarkable 17% gain on HallusionBench. Notably, these performance enhancements are transferable across different model families.

Implications for Future Research

The findings from this research establish that 3D geometric primitives in code not only serve as an effective diagnostic tool but also provide a transferable spatial vocabulary for VLMs. The ability to accurately reconstruct 3D scenes using primitive geometries significantly enhances the models’ spatial reasoning capabilities, addressing the paradox of their performance on simpler spatial queries.

As the research team prepares to release all related artifacts upon publication, the implications of this study could pave the way for further advancements in the field of AI, particularly in improving the spatial understanding of VLMs. This could lead to more sophisticated applications in areas such as robotics, augmented reality, and complex visual reasoning tasks.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Enhancing VLMs with 3D Primitives for Spatial Reasoning

3D Primitives are a Spatial Language for VLMs

Key Contributions of the Study

Implications for Future Research

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related