Lessons Without Borders? Evaluating Cultural Alignment of LLMs Using Multilingual Story Moral Generation
Published on: October 2023
Summary: arXiv:2604.08797v1 Announce Type: cross
Abstract
Stories are key to transmitting values across cultures, but their interpretation varies across linguistic and cultural contexts. In this article, we introduce multilingual story moral generation as a novel culturally grounded evaluation task. Utilizing a new dataset of human-written story morals collected across 14 language-culture pairs, we compare model outputs with human interpretations via semantic similarity, a human preference survey, and value categorization.
Introduction
The ability of language models, particularly large language models (LLMs), to understand and generate text has advanced significantly in recent years. However, the challenge of ensuring that these models align with diverse cultural perspectives remains crucial. This study investigates how effectively these models can generate moral interpretations of stories that resonate across various cultural contexts.
Methodology
To explore multilingual story moral generation, we compiled a dataset of human-written morals from stories across 14 different language-culture pairs. Our methodology encompasses the following steps:
- Dataset Creation: We gathered a diverse set of stories that represent various cultures and languages, ensuring a balanced representation of moral values.
- Model Evaluation: We utilized advanced LLMs such as GPT-4o and Gemini to generate story morals based on the input narratives.
- Comparison Metrics: We employed semantic similarity measures, conducted human preference surveys, and categorized values to assess the models’ outputs against human interpretations.
Findings
Our analysis reveals several important insights regarding the performance of contemporary LLMs in generating culturally relevant morals:
- The outputs from models like GPT-4o and Gemini exhibit significant semantic similarity to human-generated morals, indicating a competent understanding of central narrative themes.
- Human evaluators showed a preference for the model-generated morals, suggesting that these models can produce outputs that are generally acceptable to audiences.
- Despite these strengths, the models displayed a marked reduction in cross-linguistic variation, often reflecting a narrower set of widely shared values rather than the rich diversity found in human interpretations.
Discussion
These findings highlight a critical gap in the ability of current LLMs to capture the full spectrum of cultural narratives. While they can approximate common moral interpretations, they often fail to reflect the unique values and perspectives inherent in different cultures. This limitation suggests that further research is needed to enhance the cultural sensitivity of LLMs.
Conclusion
By framing narrative interpretation as an evaluative task, this work introduces a new approach to studying cultural alignment in language models. As the field of AI continues to evolve, understanding the nuances of cultural interpretation will be essential for developing models that can genuinely engage with diverse human experiences.
In summary, while current LLMs demonstrate promising capabilities in moral generation, their outputs reveal a need for greater diversity and cultural richness. Future efforts should focus on addressing these challenges to better align AI with the multifaceted nature of human narratives.
