Enhancing VLMs with 3D Primitives for Spatial Reasoning

Date:

3D Primitives are a Spatial Language for VLMs

A recent study published on arXiv, titled “3D Primitives are a Spatial Language for VLMs,” explores the intriguing capabilities of Vision-Language Models (VLMs) in generating executable code for reconstructing 3D scenes. Despite their prowess in this area, these models struggle with simpler spatial questions related to the same images. The research emphasizes the potential of 3D geometric primitives, such as cubes, spheres, and cylinders, as a robust intermediate representation for enhancing spatial understanding in VLMs.

Key Contributions of the Study

The study presents three major contributions that highlight the utility of 3D primitives in improving VLMs:

  • Introduction of SpatialBabel: This benchmark evaluates fourteen VLMs based on their performance in primitive-based 3D scene reconstruction. The study spans six different scene-code languages, which include various programming languages and declarative formats tailored for 3D primitive scenes. An intriguing finding is that a single model’s object-detection F1 score can fluctuate by as much as 5.7 times depending on the language used.
  • Development of Code-CoT: The research introduces a novel inference strategy known as Code Chain-of-Thought (Code-CoT). This approach facilitates spatial reasoning by routing it through primitive-based code generation. Code-CoT has shown to enhance the SpatialBabel-QA-Score by as much as 6.4% for primitive scenes. Moreover, it boosts the accuracy of real-photo CV-Bench-3D by 5.0% for VLMs proficient in coding tasks.
  • Proposal of S3-FT: Self-Supervised Spatial Fine-Tuning (S3-FT) is another significant advancement proposed in the study. This method distills primitive spatial knowledge into general visual reasoning without requiring human labels or a teacher model. By parsing the model’s own Three.js primitive reconstructions into structured annotations, S3-FT enhances the performance of the Qwen3-VL-8B model across various benchmarks, including an improvement of 4.6% to 8.6% on SpatialBabel-Primitive-QA, a 9.7% increase on CV-Bench-2D, and a remarkable 17% gain on HallusionBench. Notably, these performance enhancements are transferable across different model families.

Implications for Future Research

The findings from this research establish that 3D geometric primitives in code not only serve as an effective diagnostic tool but also provide a transferable spatial vocabulary for VLMs. The ability to accurately reconstruct 3D scenes using primitive geometries significantly enhances the models’ spatial reasoning capabilities, addressing the paradox of their performance on simpler spatial queries.

As the research team prepares to release all related artifacts upon publication, the implications of this study could pave the way for further advancements in the field of AI, particularly in improving the spatial understanding of VLMs. This could lead to more sophisticated applications in areas such as robotics, augmented reality, and complex visual reasoning tasks.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.