LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety
Summary: arXiv:2604.12710v1 Announce Type: cross
Abstract
Large language models (LLMs) often demonstrate strong safety performance in high-resource languages, yet exhibit severe vulnerabilities when queried in low-resource languages. We attribute this gap to a mismatch between language-agnostic semantic understanding ability and language-dominant safety alignment biased toward high-resource languages. Consistent with this hypothesis, we empirically identify the semantic bottleneck in LLMs, an intermediate layer in which the geometry of model representations is governed primarily by shared semantic content rather than language identity.
Introduction
The advancements in large language models (LLMs) have significantly transformed the field of natural language processing. However, a critical concern remains regarding their safety and reliability, particularly when handling diverse languages. The disparity in safety performance across languages has prompted researchers to investigate the underlying causes and potential solutions.
The Semantic Bottleneck
Our research identifies a crucial aspect of LLMs known as the “semantic bottleneck.” This term refers to an intermediate layer within the model where the representations are predominantly shaped by universal semantic content, rather than being influenced by the specific language of input. This phenomenon highlights a significant challenge in ensuring safety across varying linguistic contexts.
Language-Agnostic Semantic Alignment (LASA)
To address the limitations caused by the semantic bottleneck, we propose a novel framework called Language-Agnostic Semantic Alignment (LASA). This innovative approach focuses on anchoring safety alignment directly within the semantic bottlenecks of LLMs. By doing so, we aim to create a more robust safety mechanism that is less dependent on the language of input and more grounded in the underlying semantics.
Experimental Results
Our experimental findings demonstrate the effectiveness of the LASA framework in enhancing safety performance across various languages. Key results include:
- Average attack success rate (ASR) on the LLaMA-3.1-8B-Instruct model decreased from 24.7% to 2.8%.
- ASR for Qwen2.5 and Qwen3 Instruct models (7B-32B) remained consistently low, around 3-4%.
These results indicate that LASA not only addresses the vulnerabilities present in low-resource languages but also reinforces safety across the board, suggesting that a shift in focus towards semantic understanding is vital for future advancements in LLM safety.
Conclusion
In summary, our analysis and the proposed LASA framework provide a representation-level perspective on LLM safety. The findings suggest that effective safety alignment must prioritize semantic understanding over traditional language-specific approaches. As LLMs continue to evolve, adopting frameworks like LASA could pave the way for more equitable and robust models that can operate safely across all languages.
Future Work
Further research will explore the scalability of LASA to even broader language contexts and its integration into various LLM architectures. The implications of this research extend beyond safety, potentially influencing how LLMs understand and generate language in a multilingual world.
