Finding Belief Geometries with Sparse Autoencoders
Summary: arXiv:2604.02685v1 Announce Type: cross
Abstract
Understanding the geometric structure of internal representations is a central goal of mechanistic interpretability. Prior work has shown that transformers trained on sequences generated by hidden Markov models encode probabilistic belief states as simplex-shaped geometries in their residual stream, with vertices corresponding to latent generative states. Whether large language models trained on naturalistic text develop analogous geometric representations remains an open question.
Research Overview
In this study, we introduce a comprehensive pipeline designed to discover candidate simplex-structured subspaces within transformer representations. Our approach combines several advanced techniques, including:
- Sparse Autoencoders (SAEs)
- $k$-subspace clustering of SAE features
- Simplex fitting using AANet
We validate this pipeline on a transformer model trained on a multipartite hidden Markov model, which has a well-defined belief-state geometry. The primary model utilized in our experiments is Gemma-2-9B, where we successfully identify 13 priority clusters that exhibit candidate simplex geometry, characterized by having $K \geq 3$.
Challenges in Interpretation
A significant challenge in this research is distinguishing genuine belief-state encoding from tiling artifacts. It is possible for latent variables to span a simplex-shaped subspace without the mixture coordinates providing any predictive signal beyond that of individual features. To address this issue, we adopt barycentric prediction as our main criterion for discrimination.
Findings and Results
Among the 13 priority clusters identified, we found:
- 3 clusters exhibited a highly significant advantage on near-vertex samples (Wilcoxon $p < 10^{-14}$).
- 4 clusters showed significant advantages on simplex-interior samples.
- 5 distinct real clusters passed at least one of our split tests, while no null cluster achieved this.
Notably, one specific cluster, labeled 768_596, achieved the highest causal steering score within the dataset. This cluster represents a unique case where passive prediction and active intervention converge, providing compelling evidence for the existence of genuine belief-like geometry within the representation space of Gemma-2-9B.
Conclusion
Our findings serve as preliminary evidence supporting the existence of genuine belief-like geometries in the representation space of the Gemma-2-9B model. However, further structured evaluations are essential to confirm this interpretation and deepen our understanding of how large language models encode complex geometric structures in their internal representations.
