Cultural Differences in Moral Judgments via Large Language Models

Exploring Cultural Variations in Moral Judgments with Large Language Models

Summary: arXiv:2506.12433v2 Announce Type: cross

Abstract

Large Language Models (LLMs) have shown strong performance across many tasks, but their ability to capture culturally diverse moral values remains unclear. In this paper, we examine whether LLMs mirror variations in moral attitudes reported by the World Values Survey (WVS) and the Pew Research Center’s Global Attitudes Survey (PEW). We compare smaller monolingual and multilingual models (GPT-2, OPT, BLOOMZ, and Qwen) with recent instruction-tuned models (GPT-4o, GPT-4o-mini, Gemma-2-9b-it, and Llama-3.3-70B-Instruct).

Methodology

Using log-probability-based moral justifiability scores, we correlate each model’s outputs with survey data covering a broad set of ethical topics. Our research aims to understand the extent to which these models reflect human moral judgments across different cultural contexts.

Key Findings

Many earlier or smaller models often produce near-zero or negative correlations with human judgments.
In contrast, advanced instruction-tuned models achieve substantially higher positive correlations, indicating a better reflection of real-world moral attitudes.
A detailed regional analysis reveals that models align better with Western, Educated, Industrialized, Rich, and Democratic (W.E.I.R.D.) nations than with other regions.

Discussion

While scaling model size and employing instruction tuning improves alignment with cross-cultural moral norms, challenges remain for certain topics and regions. This disparity poses crucial questions about the training data diversity, potential biases, and the information retrieval implications of these models.

Implications for Future Research

Our findings suggest several areas for future research and development:

Bias Analysis: Further investigation into the biases inherent in LLMs is necessary to ensure that they do not perpetuate harmful stereotypes or cultural insensitivity.
Training Data Diversity: Increasing the diversity of training datasets can enhance the models’ ability to understand and reflect varied cultural moral values.
Improving Cultural Sensitivity: Strategies should be developed to improve the cultural sensitivity of LLMs, making them applicable and useful across different cultural contexts.

Conclusion

In summary, Large Language Models exhibit varying degrees of alignment with cultural moral norms, particularly favoring W.E.I.R.D. nations. Our research underscores the importance of ongoing efforts to build models that are more attuned to global moral perspectives, fostering better understanding and interaction across cultures.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Cultural Differences in Moral Judgments via Large Language Models

Exploring Cultural Variations in Moral Judgments with Large Language Models

Abstract

Methodology

Key Findings

Discussion

Implications for Future Research

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related