LangFIR: Efficient Language Steering Using Monolingual Data

Date:

LangFIR: Discovering Sparse Language-Specific Features from Monolingual Data for Language Steering

Summary: arXiv:2604.03532v1 Announce Type: cross

Recent advancements in large language models (LLMs) have showcased their remarkable multilingual capabilities. However, a persistent challenge remains in reliably controlling the language of their outputs. The emerging method of representation-level steering attempts to tackle this issue by incorporating language-specific vectors into model activations during inference. This approach, however, often depends on the availability of multilingual or parallel datasets, which can be costly and time-consuming to obtain.

The Role of Sparse Autoencoders (SAEs)

Sparse autoencoders (SAEs) provide a promising solution by decomposing residual activations into interpretable and sparse feature directions. This decomposition serves as a foundation for identifying language-specific steering directions. Unfortunately, existing SAE-based methods also encounter the same limitation regarding the availability of diverse language data.

Introducing LangFIR

To address these challenges, we introduce LangFIR (Language Feature Identification via Random-token Filtering), a novel method that efficiently discovers language-specific SAE features using only a minimal amount of monolingual data combined with random-token sequences. This approach reveals that many SAE features activated by target-language inputs do not actually encode language identity. By utilizing random-token sequences, LangFIR effectively surfaces these language-agnostic features, enabling the system to filter them out and isolate a sparse set of language-specific features.

Key Findings

Our research demonstrates that the features identified through LangFIR are remarkably sparse and exhibit a high degree of selectivity for their corresponding target language. Moreover, these features prove to be causally significant, as directional ablation results in an increase in cross-entropy loss exclusively for the language in question. This indicates that the features play a critical role in the language generation process.

Performance Evaluation

When employing these language-specific features to construct steering vectors for multilingual generation control, LangFIR achieves outstanding results. Specifically, it records the highest average accuracy in BLEU scores across three different models (Gemma 3 1B, Gemma 3 4B, and Llama 3.1 8B), three distinct datasets, and twelve target languages. Remarkably, LangFIR outperforms the strongest monolingual baseline by a significant margin, exceeding the performance of existing methods that rely on parallel data.

Conclusion

Our findings suggest that the identity of languages in multilingual LLMs is localized within a sparse set of feature directions that can be effectively discovered using monolingual data. This breakthrough not only enhances the understanding of language representation in LLMs but also opens new avenues for efficient language steering in future applications.

Availability of Resources

The code for LangFIR is publicly accessible at the following link: LangFIR Code Repository.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.