Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?
Summary: arXiv:2503.18018v2 Announce Type: replace
Abstract: Recent research demonstrates that large language models’ (LLMs) mathematical reasoning is culturally sensitive. Testing 14 models from companies such as Anthropic, OpenAI, Google, Meta, DeepSeek, Mistral, and Microsoft across six culturally adapted variants of the GSM8K benchmark reveals significant accuracy drops when math problems are embedded in unfamiliar cultural contexts. The accuracy drops range from 0.3% (Claude 3.5 Sonnet) to 5.9% (LLaMA 3.1-8B), with results statistically significant (p < 0.01, confirmed through McNemar tests), indicating that mathematical reasoning in LLMs is not culturally neutral.
To create these culturally adapted variants for Haiti, Moldova, Pakistan, Solomon Islands, Somalia, and Suriname, researchers systematically replaced cultural entities such as names, foods, and places in 1,198 GSM8K questions, while preserving all mathematical operations and numerical values. A quantitative error analysis of 18,887 instances reveals that cultural adaptation significantly affects broader reasoning patterns, with mathematical reasoning errors comprising 54.7% and calculation errors 34.5% of overall failures.
Key Findings
- Performance Variations: The study found that the performance of LLMs varied significantly depending on the cultural context of the mathematical problems presented. This variation highlights a crucial aspect of how LLMs interpret and process information based on cultural familiarity.
- Impact of Cultural Context: The accuracy of LLMs decreased when mathematical problems were framed in culturally unfamiliar contexts, suggesting that cultural nuances play a vital role in problem-solving scenarios.
- Specific Model Performance: Mistral Saba, surprisingly, outperformed some larger models when tackling Pakistan-adapted problems. This performance boost is attributed to the model’s exposure to Middle Eastern and South Asian training data, indicating that cultural familiarity can enhance performance.
- Need for Diverse Training Data: The findings underscore the necessity for a more diverse training dataset to ensure that LLMs can provide robust performance across various global contexts. Without such diversity, the efficacy of LLMs in real-world applications may be compromised.
Conclusion
This study highlights a significant gap in the current understanding of LLM capabilities, particularly regarding their mathematical reasoning in culturally diverse settings. The research calls for a reevaluation of training methodologies to incorporate a wider array of cultural contexts, which could lead to improved accuracy and reliability of LLMs in global applications. As LLMs continue to evolve, addressing these cultural sensitivities will be crucial for their successful integration into varied societal frameworks.
