Sparse Auto-Encoders and Holism about Large Language Models
Summary: arXiv:2603.26207v1 Announce Type: cross
Abstract: Does Large Language Model (LLM) technology suggest a meta-semantic picture i.e. a picture of how words and complex expressions come to have the meaning that they do? One modest approach explores the assumptions that seem to be built into how LLMs capture the meanings of linguistic expressions as a way of considering their plausibility (Grindrod, 2026a, 2026b).
It has previously been argued that LLMs, in employing a form of distributional semantics, adopt a form of holism about meaning (Grindrod, 2023; Grindrod et al., forthcoming). However, recent work in mechanistic interpretability presents a challenge to these arguments. Specifically, the discovery of a vast array of interpretable latent features within the high dimensional spaces used by LLMs potentially challenges the holistic interpretation.
In this paper, I will present the original reasons for thinking that LLMs embody a form of holism (section 1), before introducing recent work on features generated through sparse auto-encoders, and explaining how the discovery of such features suggests an alternative decompositional picture of meaning (section 2). I will then respond to this challenge by considering in greater detail the nature of such features (section 3). Finally, I will return to the holistic picture defended by Grindrod et al. and argue that the picture still stands provided that the features are countable (section 4).
Exploring the Holistic Interpretation of LLMs
The notion that LLMs operate under a holistic framework stems from their reliance on distributional semantics, which posits that the meaning of words is derived from their contextual usage. This framework suggests that words do not have isolated meanings but rather contribute to a network of meanings that are interdependent. Key arguments supporting this view include:
- The interconnectedness of word usage in various contexts enhances the understanding of meaning.
- LLMs generate coherent text by leveraging a holistic understanding of language rather than isolated word definitions.
- The patterns identified by LLMs reflect complex relationships between linguistic expressions.
The Role of Sparse Auto-Encoders
Recent advancements in the field of mechanistic interpretability have introduced sparse auto-encoders, which uncover interpretable latent features in high-dimensional spaces. This research indicates that:
- Sparse auto-encoders can effectively identify and isolate specific features that contribute to meaning.
- These features offer a decompositional perspective on meaning, suggesting that words can be understood through their individual components.
- The findings challenge the holistic view by presenting evidence that meaning can arise from distinct features rather than a unified whole.
Responding to the Challenge
In response to the challenge posed by the discovery of interpretable features, it is essential to delve deeper into their nature. The implications of these features could support a hybrid model of meaning that combines both holistic and decompositional perspectives. By examining:
- The countability of features and their relationship to linguistic expressions.
- The interactions between features and how they contribute to overall meaning.
- The possibility of integrating both perspectives to enrich our understanding of LLMs.
Conclusion
Ultimately, while recent findings challenge the traditional holistic interpretation of LLMs, they also pave the way for a more nuanced understanding that encompasses both holistic and decompositional aspects. As researchers continue to explore these dimensions, the ongoing dialogue will significantly shape the future of LLM interpretability and the nature of meaning in language.
