Discovering Belief Geometries Using Sparse Autoencoders

Finding Belief Geometries with Sparse Autoencoders

Summary: arXiv:2604.02685v1 Announce Type: cross

Abstract

Understanding the geometric structure of internal representations is a central goal of mechanistic interpretability. Prior work has shown that transformers trained on sequences generated by hidden Markov models encode probabilistic belief states as simplex-shaped geometries in their residual stream, with vertices corresponding to latent generative states. Whether large language models trained on naturalistic text develop analogous geometric representations remains an open question.

Research Overview

In this study, we introduce a comprehensive pipeline designed to discover candidate simplex-structured subspaces within transformer representations. Our approach combines several advanced techniques, including:

Sparse Autoencoders (SAEs)
$k$-subspace clustering of SAE features
Simplex fitting using AANet

We validate this pipeline on a transformer model trained on a multipartite hidden Markov model, which has a well-defined belief-state geometry. The primary model utilized in our experiments is Gemma-2-9B, where we successfully identify 13 priority clusters that exhibit candidate simplex geometry, characterized by having $K \geq 3$.

Challenges in Interpretation

A significant challenge in this research is distinguishing genuine belief-state encoding from tiling artifacts. It is possible for latent variables to span a simplex-shaped subspace without the mixture coordinates providing any predictive signal beyond that of individual features. To address this issue, we adopt barycentric prediction as our main criterion for discrimination.

Findings and Results

Among the 13 priority clusters identified, we found:

3 clusters exhibited a highly significant advantage on near-vertex samples (Wilcoxon $p < 10^{-14}$).
4 clusters showed significant advantages on simplex-interior samples.
5 distinct real clusters passed at least one of our split tests, while no null cluster achieved this.

Notably, one specific cluster, labeled 768_596, achieved the highest causal steering score within the dataset. This cluster represents a unique case where passive prediction and active intervention converge, providing compelling evidence for the existence of genuine belief-like geometry within the representation space of Gemma-2-9B.

Conclusion

Our findings serve as preliminary evidence supporting the existence of genuine belief-like geometries in the representation space of the Gemma-2-9B model. However, further structured evaluations are essential to confirm this interpretation and deepen our understanding of how large language models encode complex geometric structures in their internal representations.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Discovering Belief Geometries Using Sparse Autoencoders

Finding Belief Geometries with Sparse Autoencoders

Abstract

Research Overview

Challenges in Interpretation

Findings and Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related