MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
Summary: arXiv:2604.03436v1 Announce Type: cross
Abstract: Sparse autoencoders (SAEs) are increasingly used for safety-relevant applications including alignment detection and model steering. These use cases require SAE latents to be as atomic as possible. Each latent should represent a single coherent concept drawn from a single underlying representational subspace. In practice, SAE latents blend representational subspaces together. A single feature can activate across semantically distinct contexts that share no true common representation, muddying an already complex picture of model computation.
We introduce a joint training objective that directly penalizes this subspace blending. A small meta SAE is trained alongside the primary SAE to sparsely reconstruct the primary SAE’s decoder columns; the primary SAE is penalized whenever its decoder directions are easy to reconstruct from the meta dictionary. This occurs whenever latent directions lie in a subspace spanned by other primary directions. This creates gradient pressure toward more mutually independent decoder directions that resist sparse meta-compression.
Key Findings
- On GPT-2 large (layer 20), the selected configuration reduces mean $|\varphi|$ by 7.5% relative to an identical solo SAE trained on the same data.
- Automated interpretability (fuzzing) scores improve by 7.6%, providing external validation of the atomicity gain independent of the training and co-occurrence metrics.
- Reconstruction overhead is modest, indicating efficiency in the joint training approach.
- Results on Gemma 2 9B are directional, suggesting potential for broader applicability.
- On not-fully-converged SAEs, the same parameterization yields the best results, a $+8.6\%$ $\Delta$Fuzz, indicating promising outcomes in larger models.
Qualitative Analysis
Qualitative analysis confirms that features firing on polysemantic tokens are split into semantically distinct sub-features, each specializing in a distinct representational subspace. This transformation enhances the interpretability and utility of the generated features, making them more conducive for applications requiring high levels of precision and reliability.
In summary, the introduction of MetaSAEs marks a significant advancement in the field of sparse autoencoders. By implementing a joint training strategy that focuses on decomposability, researchers can achieve more atomic and coherent latent representations. This methodology not only improves the performance of SAEs but also enhances their utility in critical safety applications. Future research may explore the scalability of this approach across different model architectures and data sets, paving the way for safer and more efficient AI systems.
