Cross-Lingual Transfer of Cultural Knowledge: An Asymmetric Phenomenon¶
Conference: ACL 2025
arXiv: 2506.01675
Code: GitHub
Area: Multilingual Translation
Keywords: Cross-lingual transfer, cultural knowledge, low-resource languages, language adaptation, frequency hypothesis
TL;DR¶
By constructing an interpretable experimental framework, this study investigates the cross-lingual transfer of cultural knowledge during the language adaptation process of LLMs. It finds a bidirectional transfer between high-resource languages (Chinese, Korean) and English, whereas low-resource languages (Tibetan, Mongolian) exhibit an asymmetric transfer—knowledge mainly flows from low-resource languages to English, with limited reverse flow. The frequency hypothesis is proposed to explain this phenomenon.
Background & Motivation¶
Large Language Models (LLMs) face challenges in processing global cultural diversity. Existing research primarily evaluates whether LLMs possess cultural knowledge of non-English communities, but little is known about the acquisition mechanisms of cultural knowledge, especially under multilingual environments. This paper focuses on the following core problems:
Opacity of cross-lingual transfer: The training data and processes of LLMs are opaque, making it difficult to conduct interpretable experiments to trace the sources and influencing factors of knowledge transfer.
Factor confounding issue: When the performance of cultural knowledge question-answering in a language improves, it is difficult to distinguish whether this is due to enhanced language proficiency or cross-lingual knowledge transfer.
Underrepresented culture of low-resource languages: Existing cultural evaluations rarely focus on communities using low-resource languages (such as Tibetan and Mongolian), leaving their cultural knowledge severely underrepresented in LLMs.
The authors point out that understanding the dynamics of cross-lingual cultural knowledge transfer is of great significance for building culture-aware models, especially for serving low-resource language communities.
Method¶
Overall Architecture¶
A three-step research framework is proposed: English Pre-training → Controlled Continued Pre-training → Bilingual Evaluation. The framework has three core characteristics:
- Transparent training data: The model is trained from scratch using filtered English Wikipedia data.
- Decoupled transfer effects: Two continued pre-training settings are designed to distinguish cross-lingual transfer from language proficiency improvement.
- Bilingual parallel evaluation: Bilingual parallel probing questions are used to analyze bidirectional transfer.
Key Designs¶
1. Transparent Pre-training¶
- Function: Trains a 0.5B-parameter model from scratch, instead of using closed-source LLMs.
- Mechanism: The model is pre-trained on English Wikipedia (5B tokens) with all non-Latin characters filtered out, ensuring clear observation of the knowledge transfer process and tracing the corpus source of the acquired knowledge.
- Design Motivation: If pre-trained large models are used, the training data is opaque, making it impossible to know where the cultural knowledge originates.
- The model architecture adopts Qwen-2.5-0.5B.
2. Continued Pre-training with Decoupled Transfer Effects¶
- Function: Designs two continued pre-training settings: one promoting cross-lingual transfer, and the other minimizing transfer.
- Mechanism:
- With bridges setting: Parallel sentence pairs are added to the continued pre-training data, where each parallel pair is concatenated and mixed with other monolingual data.
- Without bridges setting: The same data is used, but the co-occurrence of parallel sentence pairs is deliberately prevented by splitting the two sentences in each pair into independent documents before mixing them into the training data.
- Design Motivation: The performance gap between the two settings can serve as an estimate of the cross-lingual transfer effect. Non-Latin script languages (Korean, Chinese, Tibetan, Mongolian) are selected to maximize isolation from English.
- Continued pre-training is performed for 1500 steps with a batch size of 0.5M tokens.
3. Bilingual Parallel Evaluation¶
- Function: Evaluates the model using English and non-English versions of cultural probing questions, respectively.
- Mechanism:
- Non-English question testing: The performance difference between the "with bridges" and "without bridges" settings represents the amount of "English → non-English" transfer.
- English question testing: The performance difference between the "with bridges" and "without bridges" settings represents the amount of "non-English → English" transfer.
- Evaluation adopts a cloze-style format, which is suitable for small models trained from scratch with limited capability.
- Korean cultural questions are derived from the CLIcK dataset, and Chinese ethnic minority cultural questions are from the book Chinese Ethnic Minorities, generated by GPT-4o and verified/translated by native speakers.
Frequency Hypothesis¶
- Hypothesis content: Cultural knowledge that appears with higher frequency in the training corpus is more likely to undergo cross-lingual transfer.
- Verification method:
- For each cultural probing question, the top 50 most relevant documents are retrieved from the corpus.
- Llama-3.1-70B is used to judge whether the retrieved documents contain the cultural knowledge in the question.
- A "cultural density" metric is introduced: number of knowledge occurrences / total documents in the corpus.
Key Experimental Results¶
Main Results: Direction of Cross-Lingual Transfer¶
| Cultural Community | Associated Language | English → Non-English Transfer | Non-English → English Transfer |
|---|---|---|---|
| Korean | Korean (High-resource) | Significant ✓ | Significant ✓ |
| Han Chinese | Chinese (High-resource) | Significant ✓ | Significant ✓ |
| Tibetan | Tibetan (Low-resource) | Insignificant ✗ | Significant ✓ |
| Mongolian | Mongolian (Low-resource) | Insignificant ✗ | Significant ✓ |
Key Findings: High-resource languages exhibit bidirectional transfer, while low-resource languages show asymmetric transfer (mainly in the non-English → English direction).
Ablation Study: Cultural Density Analysis¶
| Cultural Community | Density in English Corpus | Density in Non-English Corpus |
|---|---|---|
| Korean | 2.86e-7 | 5.21e-7 |
| Han Chinese | 2.97e-7 | 2.84e-7 |
| Tibetan | 1.49e-7 | 9.19e-6 |
| Mongolian | 1.55e-7 | 3.72e-6 |
The cultural density of low-resource languages in non-English corpora is an order of magnitude higher than in English corpora, explaining the asymmetric transfer phenomenon.
Instance-level Analysis¶
| Transfer Direction | Average Occurrences of Successfully Transferred Knowledge in Source Corpus | Overall Average Occurrences |
|---|---|---|
| English → Non-English | 9.0 (English Corpus) | 4.2 |
| Non-English → English | 4.7 (Non-English Corpus) | 2.2 |
The frequency of successfully transferred knowledge in the source language corpus is significantly higher than average, further supporting the frequency hypothesis.
Key Findings¶
- Asymmetric transfer is widespread: Cultural knowledge of low-resource languages easily transfers to English, but cultural knowledge in English is difficult to transfer back to low-resource languages.
- Frequency is a key factor: Cultural knowledge with higher frequency in the corpus is more likely to undergo cross-lingual transfer.
- Divergent roles of cross-lingual bridges: For high-resource languages, parallel sentence pairs are effective in both directions; for low-resource languages, they are primarily effective in the non-English → English direction.
- Bridge settings can compensate for forgetting: For most languages, English performance under the with-bridges setting continues to improve, indicating that cross-lingual transfer can mitigate the forgetting of English capability caused by continued pre-training.
Highlights & Insights¶
- Interpretable experimental design: Training models from scratch with transparent data and controlled settings offers a "clean" experimental environment to study knowledge transfer, which holds prominent methodological value.
- First systematic study on the cross-lingual transfer mechanism of cultural knowledge: Fills an important gap regarding "how LLMs acquire cultural knowledge."
- Concise and explanatory frequency hypothesis: Connects cross-lingual transfer with frequency effects in monolingual knowledge acquisition, forming a unified explanation.
- Attention to genuinely low-resource cultures: Selecting Tibetan and Mongolian as research objects, communities that have been historically overlooked in existing NLP cultural research.
Limitations & Future Work¶
- Model scale limitations: Only a 0.5B model is used (a computational cost trade-off for 16 settings); whether the conclusions generalize to larger models remains to be verified.
- Limited cultural coverage: The experimental design requires non-Indo-European languages with non-Latin scripts, which significantly restricts the choices; meanwhile, question collection and validation require substantial manual effort.
- Imperfect retrieval system: Cultural density analysis relies on a retrieval system, which may introduce inaccuracies.
- Only 4 cultures/languages: The sample size is small, limiting statistical significance.
- No exploration of mitigation methods for asymmetric transfer: The study only analyzes the phenomenon and its causes, without proposing solutions on how to improve the transfer of cultural knowledge for low-resource languages.
Related Work & Insights¶
- It aligns with the findings of Kandpal et al. (2023) regarding "LLMs struggling to learn tail knowledge."
- The frequency hypothesis can be generalized to other types of knowledge (not limited to cultural knowledge).
- Provides a new perspective for enhancing the cultural awareness of LLMs in the future: increasing the exposure frequency of cultural knowledge in low-resource language data.
- Offers important references for designing continued pre-training strategies for multilingual models.
Rating¶
- Novelty: ⭐⭐⭐⭐ (First systematic study of cross-lingual transfer mechanisms of cultural knowledge, ingenious experimental design)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Controlled experiments + frequency analysis + instance-level verification, complete chain of evidence)
- Writing Quality: ⭐⭐⭐⭐⭐ (Clear problem definition, precise framework description, rigorous logic)
- Value: ⭐⭐⭐⭐ (Important insights into how LLMs acquire cultural knowledge, highly significant for low-resource language research)