Discovering Belief Geometries Using Sparse Autoencoders

Date:

Finding Belief Geometries with Sparse Autoencoders

Summary: arXiv:2604.02685v1 Announce Type: cross

Abstract

Understanding the geometric structure of internal representations is a central goal of mechanistic interpretability. Prior work has shown that transformers trained on sequences generated by hidden Markov models encode probabilistic belief states as simplex-shaped geometries in their residual stream, with vertices corresponding to latent generative states. Whether large language models trained on naturalistic text develop analogous geometric representations remains an open question.

Research Overview

In this study, we introduce a comprehensive pipeline designed to discover candidate simplex-structured subspaces within transformer representations. Our approach combines several advanced techniques, including:

  • Sparse Autoencoders (SAEs)
  • $k$-subspace clustering of SAE features
  • Simplex fitting using AANet

We validate this pipeline on a transformer model trained on a multipartite hidden Markov model, which has a well-defined belief-state geometry. The primary model utilized in our experiments is Gemma-2-9B, where we successfully identify 13 priority clusters that exhibit candidate simplex geometry, characterized by having $K \geq 3$.

Challenges in Interpretation

A significant challenge in this research is distinguishing genuine belief-state encoding from tiling artifacts. It is possible for latent variables to span a simplex-shaped subspace without the mixture coordinates providing any predictive signal beyond that of individual features. To address this issue, we adopt barycentric prediction as our main criterion for discrimination.

Findings and Results

Among the 13 priority clusters identified, we found:

  • 3 clusters exhibited a highly significant advantage on near-vertex samples (Wilcoxon $p < 10^{-14}$).
  • 4 clusters showed significant advantages on simplex-interior samples.
  • 5 distinct real clusters passed at least one of our split tests, while no null cluster achieved this.

Notably, one specific cluster, labeled 768_596, achieved the highest causal steering score within the dataset. This cluster represents a unique case where passive prediction and active intervention converge, providing compelling evidence for the existence of genuine belief-like geometry within the representation space of Gemma-2-9B.

Conclusion

Our findings serve as preliminary evidence supporting the existence of genuine belief-like geometries in the representation space of the Gemma-2-9B model. However, further structured evaluations are essential to confirm this interpretation and deepen our understanding of how large language models encode complex geometric structures in their internal representations.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.