Exploring Cultural Variations in Moral Judgments with Large Language Models
Summary: arXiv:2506.12433v2 Announce Type: cross
Abstract
Large Language Models (LLMs) have shown strong performance across many tasks, but their ability to capture culturally diverse moral values remains unclear. In this paper, we examine whether LLMs mirror variations in moral attitudes reported by the World Values Survey (WVS) and the Pew Research Center’s Global Attitudes Survey (PEW). We compare smaller monolingual and multilingual models (GPT-2, OPT, BLOOMZ, and Qwen) with recent instruction-tuned models (GPT-4o, GPT-4o-mini, Gemma-2-9b-it, and Llama-3.3-70B-Instruct).
Methodology
Using log-probability-based moral justifiability scores, we correlate each model’s outputs with survey data covering a broad set of ethical topics. Our research aims to understand the extent to which these models reflect human moral judgments across different cultural contexts.
Key Findings
- Many earlier or smaller models often produce near-zero or negative correlations with human judgments.
- In contrast, advanced instruction-tuned models achieve substantially higher positive correlations, indicating a better reflection of real-world moral attitudes.
- A detailed regional analysis reveals that models align better with Western, Educated, Industrialized, Rich, and Democratic (W.E.I.R.D.) nations than with other regions.
Discussion
While scaling model size and employing instruction tuning improves alignment with cross-cultural moral norms, challenges remain for certain topics and regions. This disparity poses crucial questions about the training data diversity, potential biases, and the information retrieval implications of these models.
Implications for Future Research
Our findings suggest several areas for future research and development:
- Bias Analysis: Further investigation into the biases inherent in LLMs is necessary to ensure that they do not perpetuate harmful stereotypes or cultural insensitivity.
- Training Data Diversity: Increasing the diversity of training datasets can enhance the models’ ability to understand and reflect varied cultural moral values.
- Improving Cultural Sensitivity: Strategies should be developed to improve the cultural sensitivity of LLMs, making them applicable and useful across different cultural contexts.
Conclusion
In summary, Large Language Models exhibit varying degrees of alignment with cultural moral norms, particularly favoring W.E.I.R.D. nations. Our research underscores the importance of ongoing efforts to build models that are more attuned to global moral perspectives, fostering better understanding and interaction across cultures.
