LangFIR: Discovering Sparse Language-Specific Features from Monolingual Data for Language Steering
Summary: arXiv:2604.03532v1 Announce Type: cross
Recent advancements in large language models (LLMs) have showcased their remarkable multilingual capabilities. However, a persistent challenge remains in reliably controlling the language of their outputs. The emerging method of representation-level steering attempts to tackle this issue by incorporating language-specific vectors into model activations during inference. This approach, however, often depends on the availability of multilingual or parallel datasets, which can be costly and time-consuming to obtain.
The Role of Sparse Autoencoders (SAEs)
Sparse autoencoders (SAEs) provide a promising solution by decomposing residual activations into interpretable and sparse feature directions. This decomposition serves as a foundation for identifying language-specific steering directions. Unfortunately, existing SAE-based methods also encounter the same limitation regarding the availability of diverse language data.
Introducing LangFIR
To address these challenges, we introduce LangFIR (Language Feature Identification via Random-token Filtering), a novel method that efficiently discovers language-specific SAE features using only a minimal amount of monolingual data combined with random-token sequences. This approach reveals that many SAE features activated by target-language inputs do not actually encode language identity. By utilizing random-token sequences, LangFIR effectively surfaces these language-agnostic features, enabling the system to filter them out and isolate a sparse set of language-specific features.
Key Findings
Our research demonstrates that the features identified through LangFIR are remarkably sparse and exhibit a high degree of selectivity for their corresponding target language. Moreover, these features prove to be causally significant, as directional ablation results in an increase in cross-entropy loss exclusively for the language in question. This indicates that the features play a critical role in the language generation process.
Performance Evaluation
When employing these language-specific features to construct steering vectors for multilingual generation control, LangFIR achieves outstanding results. Specifically, it records the highest average accuracy in BLEU scores across three different models (Gemma 3 1B, Gemma 3 4B, and Llama 3.1 8B), three distinct datasets, and twelve target languages. Remarkably, LangFIR outperforms the strongest monolingual baseline by a significant margin, exceeding the performance of existing methods that rely on parallel data.
Conclusion
Our findings suggest that the identity of languages in multilingual LLMs is localized within a sparse set of feature directions that can be effectively discovered using monolingual data. This breakthrough not only enhances the understanding of language representation in LLMs but also opens new avenues for efficient language steering in future applications.
Availability of Resources
The code for LangFIR is publicly accessible at the following link: LangFIR Code Repository.
