Skip to content

Language Models Entangle Language and Culture

Conference: ACL 2026
arXiv: 2601.15337
Code: None
Area: Multilingual / Cultural Bias
Keywords: Multilingual LLMs, Cultural Bias, Language-Culture Entanglement, LLM Evaluation, Fairness

TL;DR

This paper evaluates multilingual LLMs using general advice-seeking questions constructed from the WildChat dataset. It discovers systematic differences in response quality and cultural context across different language queries—response quality in low-resource languages is significantly lower than in English. Furthermore, the choice of language implicitly alters the cultural information utilized in responses. This entanglement between language and culture in LLMs is verified through a translated version of CulturalBench.

Background & Motivation

Background: LLMs such as ChatGPT are utilized by hundreds of millions for daily queries (health, finance, education, etc.) across multiple languages. Existing multilingual evaluations, such as MMMLU and BenchMAX, primarily focus on MCQ tasks like knowledge QA and mathematical reasoning, assessing only accuracy while neglecting variations in response style and cultural context.

Limitations of Prior Work: (1) Current multilingual benchmarks evaluate "correctness" rather than "quality"—lacking assessment for open-ended advice-seeking responses; (2) existing bias research triggers bias by embedding cultural cues (names, nationalities, etc.) in prompts, which does not reflect actual user behavior; (3) no systematic work has established the relationship between language selection and cultural context.

Key Challenge: LLMs implicitly bind language and culture during the training process—when querying in a specific language, the model may not only produce lower-quality responses but also apply the cultural framework associated with that language. Consequently, the same problem receives fundamentally different advice depending on the language. This creates systematic disadvantages for users of low-resource languages.

Goal: (1) Construct a dataset of general advice-seeking questions to evaluate response quality differences across languages; (2) verify whether language choice alters the cultural context of responses; (3) further validate the language-culture entanglement hypothesis via a translated version of CulturalBench.

Key Insight: Using culturally neutral open-ended questions (containing no cultural cues) allows the observation of whether changing only the query language leads to shifts in cultural context—this reflects real-world user interaction scenarios more accurately than methods using embedded cultural cues.

Core Idea: Language and culture are entangled in LLMs—selecting different languages not only affects response quality but also implicitly activates different cultural information, causing even culturally neutral general questions to yield culturally biased responses.

Method

Overall Architecture

The evaluation is divided into three parts: (1) constructing 20 culturally neutral advice-seeking questions based on WildChat, translated into 6 languages (English, Chinese, Hindi, Brazilian Portuguese, Swahili, Hebrew); (2) generating responses from 5 multilingual LLMs and evaluating quality differences using LLM-as-Judge; (3) performing cultural classification of responses and verifying language-culture entanglement on the translated CulturalBench.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
    A["Real English Dialogues from WildChat"] --> S1
    subgraph S1["Culturally Neutral Question Construction"]
        direction TB
        B["Filtering + Deduplication<br/>(fuzzywuzzy threshold 60)"] --> C["Embedding Clustering<br/>(Qwen3-0.6b + HDBSCAN)"]
        C --> D["Human Condensation of 20 Culturally Neutral Questions"]
        D --> E["Translation into 6 Languages"]
    end
    S1 --> F["Response Generation by 5 Multilingual LLMs"]
    F --> G["LLM-as-Judge Evaluation Config Optimization<br/>(Original + 8 Refs, Translation Control for Bias)"]
    G --> R1["Cross-lingual Quality Difference Conclusions"]
    F --> S3
    subgraph S3["Dual Validation of Cultural Entanglement"]
        direction TB
        H1["Post-translation Cultural Classification<br/>(Categorization into 6 Cultures)"]
        H2["Translated CulturalBench<br/>(Kruskal-Wallis Test)"]
    end
    S3 --> R2["Language-Culture Entanglement Conclusions"]

Key Designs

1. WildChat-based culturally neutral question construction: Ensuring evaluation questions are close to real queries without cultural suggestions

Existing bias studies often embed cultural cues like names or nationalities in prompts to "fish" for bias, but real users do not typically query this way. These conclusions might not reflect the model's intrinsic tendencies. Ours reverses this: first filtering English queries from WildChat, removing high-frequency programming questions, keeping items between 40–400 characters, deduplicating with fuzzywuzzy (threshold 60), generating embeddings with Qwen3-0.6b, and clustering with HDBSCAN. After human analysis, 20 questions covering health, education, investment, and job searching were condensed. Crucially, these questions are designed to be culturally neutral—mentioning no countries, ethnicities, or cultural references—ensuring that any cultural color in the responses arises from the model itself.

2. LLM-as-Judge evaluation config optimization: Eliminating judge bias before assessing cross-lingual quality

Using an LLM to score multilingual responses carries the risk of linguistic bias—if the judge naturally prefers English, the "lower quality in low-resource languages" becomes a circular argument. To address this, 6 judging configurations (original vs. translation, varying reference counts) were tested and aligned with human annotations using Pearson correlation and Cohen's Kappa. The final setup used "original language query + original language response + 8 random reference responses" with Cohere Command-A as the judge. A critical control experiment showed that English responses translated into Hindi still scored higher than native Hindi responses translated into English—proving the score gap stems from content quality, not linguistic preference of the judge.

3. Mechanism (Dual Validation of Cultural Entanglement): Proving "switching language = switching culture" rather than just quality degradation

Quality differences only indicate that low-resource languages result in "poorer" responses, which is insufficient to prove language and culture are entangled. Two independent validations were added. First, post-translation classification: all non-English responses were translated into English and classified by an LLM-as-Judge into six cultures (Western, Indian, Chinese, African, Latin American, Jewish). Even without the original language, the model identified the cultural source; Hindi queries most frequently resulted in Indian culture labels, and Chinese queries in Chinese labels. Second, Qwen3-14B was evaluated on a translated CulturalBench (750+ questions, 29 regions). Accuracy for the same cultural knowledge question varied significantly across languages (Kruskal-Wallis \(H=45.52\), \(p=1.14\times10^{-8}\)). To rule out general sensitivity to perturbation, a random string control was used, showing no significant performance change (\(H=1.02\), \(p=0.80\)). These two pieces of evidence from content and accuracy dimensions confirm that language choice alters cultural content.

Loss & Training

This is an evaluation study and does not involve model training. Cross-lingual differences were validated for statistical significance using the Kruskal-Wallis non-parametric test.

Key Experimental Results

Main Results

Kruskal-Wallis Test for Cross-lingual Quality Differences

Model H-statistic p-value Significance
Cohere-Aya-32B 712.80 \(8.39\times10^{-152}\) Highly Significant
Cohere-Aya-8B 721.13 \(1.33\times10^{-153}\) Highly Significant
Magistral-Small 610.81 \(9.33\times10^{-130}\) Highly Significant
Qwen3-14B 928.91 \(1.48\times10^{-198}\) Highly Significant
Sarvam-m 899.84 \(2.89\times10^{-192}\) Highly Significant

All models performed best in English, while Hindi, Swahili, and Hebrew consistently showed poorer performance.

Ablation Study

Translated CulturalBench vs. Random Perturbation (Qwen3-14B)

Condition H-statistic p-value Conclusion
Cross-lingual 45.52 \(1.14\times10^{-8}\) Significant Difference
Random String 1.02 0.80 No Significant Difference

Key Findings

  • All 5 models performed significantly worse in at least one language; English was consistently superior.
  • Cohere-Aya-32B showed better cross-lingual consistency than Cohere-Aya-8B, suggesting larger models are more stable across languages.
  • Although Sarvam-m and Magistral share the same base (Mistral-small-3.1-24B), different fine-tuning strategies led to varying language strengths—Sarvam-m performed better in English and Hindi, while Magistral was stronger in Chinese and Portuguese.
  • Cultural classification experiments showed that Hindi queries led to the highest proportion of Indian culture classifications and Chinese queries to Chinese culture; cultural traits remained identifiable even after translation to English.

Highlights & Insights

  • Using culturally neutral questions to reveal language-culture entanglement is an ingenious experimental design—it eliminates confounding factors like manually injected cultural cues, making the findings more persuasive.
  • The control experiment for judge bias (evaluating translated responses) is a methodological strength, addressing a potential confounder often overlooked in multilingual evaluations.
  • The discovery of language-culture entanglement has direct practical implications for LLM deployment: users may receive culturally biased advice simply by using their native language, such as investment advice implicitly favoring habits associated with that language's culture.

Limitations & Future Work

  • Evaluation was limited to small-to-medium-scale open-source models (up to 32B); larger models may exhibit different behaviors.
  • The 20 questions have limited coverage; although based on real distributions, the sample size is small.
  • Reliance on LLM-as-Judge remains a factor; despite validation, systematic biases may persist.
  • Only 6 languages were covered; the performance of more low-resource languages needs exploration.
  • Mechanism not fully explored—the root cause of language-culture entanglement (training data distribution? tokenizer?) requires interpretability analysis.
  • vs MMMLU/BenchMAX: While they evaluate MCQ accuracy, Ours assesses open-ended response quality and cultural context—revealing a crucial dimension missed by existing benchmarks.
  • vs Bąk et al. / Schlicht et al.: They evaluate multilingual bias in specific domains (email/medical), whereas Ours covers broader general queries.
  • vs IndQA (OpenAI): Similar focus but limited to Indian languages; Ours covers multiple regions/languages and establishes a general conclusion of language-culture entanglement.

Rating

  • Novelty: ⭐⭐⭐⭐ Systematically reveals entanglement via neutral questions for the first time, though the method focuses on evaluation rather than a solution.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive approach involving multiple models, languages, statistical tests, judge bias controls, and random perturbation controls.
  • Writing Quality: ⭐⭐⭐⭐ Logical flow with step-by-step experimental progression.
  • Value: ⭐⭐⭐⭐ Direct guidance for multilingual LLM fairness and deployment, though the lack of a proposed solution slightly reduces practical utility.