Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning¶
Conference: ACL 2025 (Findings)
arXiv: 2502.11364
Code: GitHub
Area: Multilingual Translation
Keywords: multilingual ICL, cross-lingual transfer, low-resource languages, in-context learning, prompting strategies
TL;DR¶
Through a systematic analysis of multilingual ICL strategies, this study reveals that mixing demonstrations of various high-resource languages (HRLs) in the prompt consistently outperforms purely English demonstrations, yielding particularly significant improvements on low-resource languages (LRLs) (e.g., an 8.9% to 12.6% average LRL accuracy gain on Llama 3.1). Intriguingly, even merely appending context-irrelevant non-English sentences to the prompt yields measurable gains, revealing the phenomenon that "multilingual exposure itself is effective."
Background & Motivation¶
Background: Multilingual LLMs perform acceptably on high-resource languages (HRLs), sometimes even approaching English levels, but lag significantly behind on low-resource languages (LRLs). In-Context Learning (ICL) represents a primary approach for enhancing cross-lingual performance.
Limitations of Prior Work: Two common ICL strategies suffer from critical limitations: (a) translating queries to English and then performing English ICL introduces substantial translation loss, and high-quality translation systems are unavailable for extremely low-resource languages; (b) utilizing demonstrations in the target language is practically unfeasible due to data scarcity.
Key Challenge: While Shi et al. (2023) observed the effectiveness of mixing HRL demonstrations, a systematic understanding of why it works remains lacking—is the multilingual information itself beneficial, or are specific language combinations more critical?
Goal: Systematically analyze the underlying mechanism of multilingual ICL—when it is effective, why it works, and which languages are most beneficial.
Key Insight: Carefully design controlled experiments—varying only the presentation language of semantically equivalent demonstrations—and introduce Context-Irrelevant Sentences (CIS) to decouple the "multilingual exposure effect" from the "semantic information effect."
Core Idea: Multilingual exposure by itself can activate the cross-lingual capabilities of MLLMs, and mixing HRL demonstrations stands as the most robust and practically feasible cross-lingual ICL strategy currently available.
Method¶
Overall Architecture¶
Four ICL paradigms are systematically compared across eight MLLMs on four multilingual benchmarks (MGSM, XCOPA, XL-WiC, and XQuAD), complemented by CIS ablation experiments to isolate the effects of multilingual exposure.
Key Designs¶
1. Controlled Comparison of Four ICL Paradigms
| Paradigm | Description | Feasibility |
|---|---|---|
| English | 6-shot entirely English demonstrations | Always feasible |
| Monolingual | 6-shot using a single non-English HRL (e.g., Chinese/Japanese) | Requires HRL data |
| Multilingual | 6-shot randomly mixed from a list of HRLs | Recommended strategy |
| Native | 6-shot using the target language (ideal upper bound) | Typically unfeasible for LRLs |
Core control variable: Demonstrations of the same index are semantically equivalent across all languages, with only the presentation language varied.
2. Context-Irrelevant Sentences (CIS) Ablation Study
- Prepending a completely task-irrelevant non-English sentence (sourced from FLORES-101) to each English demonstration.
- Design Motivation: Decouple the two factors of "semantic information" and "language exposure."
- CIS Variants: CIS-Fr / CIS-Ja / CIS-Zh / CIS-Multi.
- Key Finding: Merely appending irrelevant non-English texts is sufficient to improve LRL accuracy.
3. Exceptional Effectiveness of Non-Latin Script HRLs
- Chinese and Japanese deliver the best cross-lingual transfer effects when utilized as Monolingual demonstrations.
- Chinese outperforms English in 20 out of 30 LRL pairwise comparisons.
- Probable cause: They prompt the model to project inputs into a more "language-agnostic" representation space.
Experimental Setup¶
- Eight MLLMs: Llama3/3.1-8B, Qwen2/2.5-7B, Mistral-NeMo-12B, Aya-Expanse-8B, GPT-3.5, GPT-4o-mini
- 6-shot demonstrations, greedy decoding
- Statistical Testing: McNemar's test (continuity-corrected version, Edwards 1948), reporting p-value significance levels
Key Experimental Results¶
Main Results: LRL Average Accuracy (Representative Models)¶
| Model | ICL Paradigm | MGSM | XL-WiC | XCOPA |
|---|---|---|---|---|
| Llama3.1-8B | English | 57.10 | 44.42 | 55.91 |
| Multilingual | 66.00 (+8.9***) | 57.05 (+12.6***) | 66.11 (+10.2***) | |
| Native | 68.50 (+11.4***) | 62.88 (+18.5***) | 71.63 (+15.7***) | |
| Qwen2-7B | English | 43.70 | 48.46 | 62.29 |
| Multilingual | 47.50 (+3.8**) | 56.28 (+7.8***) | 63.83 (+1.5*) | |
| Native | 55.40 (+11.7***) | 57.76 (+9.3***) | 67.63 (+5.3***) | |
| GPT3.5-turbo | English | 44.20 | 53.72 | 63.43 |
| Multilingual | 51.00 (+6.8***) | 55.90 (+2.2**) | 62.71 (-0.7) |
CIS Ablation: Effects of Irrelevant Non-English Sentences (Llama3.1-8B LRL Avg)¶
| Setup | MGSM | XL-WiC | XCOPA | XQuAD |
|---|---|---|---|---|
| English + CIS-En (baseline) | 55.90 | 47.88 | 55.46 | 68.45 |
| English + CIS-Fr | 52.10 | 52.82*** | 59.40*** | 68.95 |
| English + CIS-Ja | 58.80 | 55.96*** | 59.86*** | 68.90 |
| English + CIS-Zh | 55.00 | 54.68*** | 64.66*** | 69.10 |
| English + CIS-Multi | 62.50*** | 56.03*** | 64.74*** | 69.60*** |
Monolingual Paradigm Comparisons¶
- Chinese Monolingual outperforms English in 20 out of 30 LRL comparisons.
- Japanese also frequently outperforms English.
- Multilingual (mixed) is more robust than any single HRL, outperforming Chinese Monolingual in 23 out of 30 sets.
Key Findings¶
- Multilingual ICL outperforms English in 23 out of 30 evaluation groups, with the vast majority of improvements reaching statistical significance.
- The Native paradigm performs best but remains impractical—outperforming Multilingual in 27 out of 30 cases, yet acquisition of LRL data is notoriously difficult.
- Non-Latin script HRLs (Chinese/Japanese) are exceptionally effective in driving cross-lingual transfer.
- Merely introducing irrelevant non-English sentences in the prompt boosts LRL performance, indicating that multilingual exposure itself has an activation effect.
- Neurons activated by Multilingual overlap closest with those activated by Native, providing a mechanistic explanation for their comparable performance.
Highlights & Insights¶
- The discovery that "multilingual exposure itself is effective" is highly surprising—elegantly demonstrated by the CIS experiments, this is not merely an empirical finding but an in-depth insight into the internal cross-lingual mechanisms of MLLMs.
- The experimental design is exceptionally rigorous: utilizing semantically equivalent demonstrations to control for content variables, and executing McNemar's tests to guarantee statistical significance.
- High practical value: The Multilingual ICL strategy does not require LRL data, needing only a mixture of HRL demonstrations, making it both simple and highly effective.
- Neuron overlap analysis provides a mechanistic explanation for multilingual ICL, moving beyond simple performance comparisons.
- Clear recommendations for language selection: Prioritize non-Latin script HRLs (Chinese/Japanese), as multilingual mixtures outperform single-language settings.
Limitations & Future Work¶
- Limited model scale (~8B): The study does not verify whether the multilingual ICL effect remains significant in larger models (e.g., 70B+).
- Restricted task types: The four benchmarks cover reasoning, common sense, word sense disambiguation, and QA, but do not touch upon generative tasks such as translation or summarization.
- Pre-defined HRL list depends on the language distribution profiles of Llama 2/PaLM, which may not translate perfectly to newer models.
- Noise control in CIS experiments: FLORES-101 sentences are constrained to 10–15 words; different sentence lengths might yield different effects.
- Lack of deep explanation for "why non-Latin scripts are more effective": The neuron overlap analysis serves as a preliminary exploration, and causal relationships remain unclear.
Related Work & Insights¶
- Extends the findings of Shi et al. (2023): scaling the effectiveness of "Multilingual ICL" from PaLM/Codex to six open-source MLLMs and two commercial models.
- Extends the insights of Turc et al. (2021): non-English languages are not only highly effective during pre-training and fine-tuning, but also within prompting paradigms.
- Paradigmatic value of the CIS experiments: By inserting irrelevant content to decouple the two effects, this ablation setup is highly generalizable to other prompt engineering studies.
- Implications for practical multilingual deployment: In LRL scenarios, rather than expending resources to acquire target-language demonstrations, practitioners can directly employ mixed HRL demonstrations.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The CIS ablation experiments are cleverly designed, and the main discovery ("multilingual exposure itself is effective") is highly impactful.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Extremely comprehensive coverage spanning eight models, four datasets, four or more ICL paradigms, statistical tests, and neuron analysis.
- Writing Quality: ⭐⭐⭐⭐ — Clear diagrams, logically presented progressive experiments, and highly persuasive conclusions.
- Value: ⭐⭐⭐⭐ — Direct practical relevance for deploying multilingual LLMs, offering a simple yet highly efficient Multilingual ICL strategy.
| ICL Paradigm | MGSM (Llama3.1) | XL-WiC (Llama3.1) | XCOPA (Llama3.1) |
|---|---|---|---|
| English | 57.10 | 44.42 | 55.91 |
| Multilingual | 66.00 (+8.9%***) | 57.05 (+12.6%***) | 66.11 (+10.2%***) |
| Native (Upper Bound) | 68.50 (+11.4%***) | 62.88 (+18.5%***) | 71.63 (+15.7%***) |
Ablation: Effects of Irrelevant Multilingual Exposure¶
| Configuration | LRL Acc Change | Description |
|---|---|---|
| English-only demos | baseline | Baseline |
| + Irrelevant Chinese/Japanese sentences | +2~5% | Pure language exposure is also beneficial |
| Multilingual demos (full) | +8~13% | Dual activation of both semantics and language |
Key Findings¶
- Multilingual > English in 23 out of 30 cases—a pervasive phenomenon across MLLMs and datasets.
- Non-Latin script HRLs (Chinese/Japanese) are particularly effective—likely activating more language-agnostic representations.
- Mixed HRLs are more robust than a single HRL—preventing bias toward any single language.
- Irrelevant foreign language texts also yield gains—moving beyond the semantic level, possibly by activating cross-lingual attention patterns.
- Native mode performs best but is impractical—multilingual mode stands as the best feasible alternative.
Highlights & Insights¶
- "Irrelevant foreign languages also help" is the most striking discovery: suggesting a "language-switching activation" mechanism within MLLMs—encountering non-English text triggers a transition to more general cross-lingual processing modes. This finding provides crucial implications for understanding MLLM inner mechanics.
- Practical value for LRLs: A zero-cost improvement—simply mixing HRL demonstrations into the prompt.
- Exquisitely controlled experiments: Semantically equivalent parallel demonstrations + McNemar's test +踩 irrelevant language ablation.
Limitations & Future Work¶
- Limited model scale: Restricting to the 7B–12B range; does this still hold for 70B+ models? Larger models' English capabilities may already be sufficiently robust.
- Only four benchmarks evaluated: Generative tasks (e.g., translation, summarization) are not covered, where language matching might be of greater importance.
- Mechanistic interpretation is missing: Why is multilingual exposure effective? Mechanistic interpretability research is still required (e.g., investigating shifts in attention patterns or representation space probing).
- Subjective HRL list definitions: Different models are pre-trained on different datasets, which means the definition of "high-resource" might vary.
- Demonstration quality not considered: All demonstrations are assumed to be of high-quality; how would noisy or poorly translated demonstrations affect performance?
- Fixed at 6-shot: How does the gain curve of multilingual ICL fluctuate under different shot counts?
- Optimal HRL combinations unexplored: Does there exist a selection strategy for a "best HRL set" (e.g., typologically diverse language groupings)?
Related Work & Insights¶
- vs. Shi et al. (2023): They were the first to identify that multilingual ICL is effective on PaLM; this paper systematically verifies it across eight MLLMs and introduces irrelevant language ablation setups.
- vs. Translate-Test (Ahuja et al., 2023): Translate-test strategies are unfeasible for extremely low-resource languages, whereas multilingual ICL does not rely on intermediate translation steps.
Rating¶
- Novelty: ⭐⭐⭐⭐ The discovery that "irrelevant foreign languages also help" is novel, though multilingual ICL itself is not a new concept.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 8 MLLMs × 4 benchmarks × 4 modes + multiple ablations + statistical testing.
- Writing Quality: ⭐⭐⭐⭐ Exquisitely designed control experiments with clear, well-supported conclusions.
- Value: ⭐⭐⭐⭐ Offers direct guidance for multilingual LLM deployment, showing maximum benefits in LRL scenarios.