Exploring In-context Example Generation for Machine Translation¶

Conference: ACL 2025
arXiv: 2506.00507
Code: GitHub
Area: Multilingual Translation
Keywords: Machine Translation, In-context Learning, Low-resource Languages, Example Generation, Maximal Marginal Relevance

TL;DR¶

Proposes DAT (Demonstration Augmentation for Translation), which enables LLMs to automatically generate relevant and diverse source-target sentence pairs as in-context demonstrations without any external resources. DAT outperforms zero-shot and fixed-demonstration few-shot baselines across five low-resource language translation tasks.

Background & Motivation¶

Background: In-context learning (ICL) in LLMs has demonstrated powerful capabilities in machine translation. Prior work (e.g., R-BM25, CTQScorer) has focused on selecting the optimal demonstrations from an existing human-annotated corpus pool, achieving excellent results for high-resource languages.

Limitations of Prior Work: The aforementioned methods rely on a crucial assumption—the existence of a large-scale, human-annotated source-target parallel corpus pool. For low-resource languages (such as Nepali, Khmer, Pashto, Zulu, and Swahili), this assumption does not hold because public datasets and human annotators are extremely scarce.

Key Challenge: Low-resource languages need ICL the most to compensate for the lack of data, yet they are precisely the ones lacking the demonstration pools required for ICL. While El Mekki & Abdul-Mageed (2025) attempted to use LLMs to generate synthetic parallel data, their approach still requires bilingual lexicons and unannotated texts in the target language.

Goal: Is it possible to dynamically generate high-quality in-context demonstrations for each translation request solely relying on the linguistic capabilities of LLMs, without depending on any external resources?

Key Insight: Based on two key priors identified in previous research—relevance and diversity—LLMs can be guided to generate source-side sentences, which are then filtered using MMR and translated to obtain demonstration pairs.

Core Idea: Let the LLM "create its own tests"—automatically generating relevant and diverse translation prompt demonstrations for each sentence to be translated.

Method¶

Overall Architecture¶

Given a user query \(q\) (the sentence to be translated), the process consists of four steps: 1. Generate \(m\) relevant source-side sentences using the LLM (satisfying both relevance and diversity). 2. Select \(k\) optimal sentences using MMR filtering. 3. Translate these \(k\) sentences using the LLM to obtain the target-side responses. 4. Use the \(k\) source-target pairs as in-context demonstrations to translate \(q\).

Key Designs¶

Source-side Generation:
- Uses zero-shot prompting to guide the LLM to generate \(m=10\) sentences that are relevant to the query \(q\) yet distinct from each other.
- The relevance and diversity priors are embedded directly into the prompt.
- Design Motivation: Relevant sentences provide translation clues (shared vocabulary/syntax), while diversity avoids redundancy.
Filtering using MMR:
- Relevance metric—n-gram recall: \(\alpha(q, x_i) = \frac{1}{4}\sum_{n=1}^{4} R_n(q, x_i)\)
- Where \(R_n\) is the n-gram recall score between \(q\) and \(x_i\).
- MMR selection formula: \(\arg\max_{x_i \in X \setminus X^*}\left[\alpha(q, x_i) - \frac{\lambda}{|X^*|}\sum_{x_j \in X^*}\alpha(x_j, x_i)\right]\)
- Iteratively selects \(k=4\) sentences, ensuring each chosen sentence is both relevant to \(q\) and distinct from the already selected ones.
- Design Motivation: Directly generated \(m\) sentences may vary in quality; filtering ensures a high upper bound of quality.
Target-side Generation:
- Performs zero-shot translation using the LLM for each selected source sentence.
- Resulting in \(k\) source-target pairs: \(D^* = \{(x_i^*, \text{LLM}(x_i^*))\}_{i=1}^k\).
Query Translation:
- \(\hat{y} = \text{LLM}(I, D^*, q)\)
- Where \(I\) is the translation instruction.

Extension: Accumulation Setting¶

Gradually accumulates the generated translation pairs during testing into a demonstration pool.
Subsequent translations can retrieve demonstrations from the pool via R-BM25, reducing real-time generation overhead.

Key Experimental Results¶

Main Results (COMET Scores, English \(\rightarrow\) Low-Resource Languages)¶

Fixed Pairs	Model	Method	Nepali	Khmer	Pashto	Zulu	Swahili
✘	Llama-3.1-8B	Zero-shot	72.1	62.0	53.9	23.3	60.6
✘	Llama-3.1-8B	DAT	74.9	64.4	54.6	22.3	61.8
✘	Llama-3.1-70B	Zero-shot	79.8	72.7	67.5	37.8	72.9
✘	Llama-3.1-70B	DAT	81.1	72.4	68.3	38.3	73.4
✔	Llama-3.1-70B	Few-shot	80.6	51.1	65.7	38.9	71.6
✔	Llama-3.1-70B	DAT	81.5	52.9	68.5	39.2	72.7

Without fixed pairs: DAT improves by 2.8 COMET points on Nepali (8B) and significantly outperforms zero-shot across most languages.
With fixed pairs: DAT outperforms traditional few-shot methods in most languages.

In-Context Demonstration Quality Analysis (Llama-3.1-70B)¶

Method	Nepali Relevance↑	Nepali Quality↑	Nepali COMET↑	Khmer Relevance↑	Khmer COMET↑
Retrieval(src)	7.5	80.7	80.5	7.5	61.4
Fixed set(pair)	3.9	89.4	80.6	3.9	51.1
DAT	25.9	82.5	81.1	25.9	72.4

DAT's Relevance (25.9) far exceeds Retrieval (7.5) and Fixed set (3.9).
Although the Fixed set has the highest Quality (89.4), its COMET is not the best—relevance is more important than absolute quality.

Ablation Study (Effect of MMR Filtering)¶

Method	\(m\)	\(k\)	Khmer	Pashto	Swahili
No Filtering₄	4	4	63.8	54.2	61.9
No Filtering₁₀	10	10	63.4	53.6	61.7
DAT	10	4	64.4	54.6	62.3

10 unfiltered demonstrations (63.4) perform worse than 4 (63.8)—additional demonstrations acts as noise.
DAT's generate-10-and-select-4 strategy is optimal.

Key Findings¶

High-quality fixed pairs can "backfire": In English \(\rightarrow\) Khmer translation, after using fixed human-annotated pairs, Llama-3.1-70B's COMET score crashed by 21.6 points, and the model generated abnormally long, repetitive outputs.
Relevance is the most crucial prior: It is highly predictive of final translation performance, even more so than absolute translation quality.
Self-generated demonstrations are effective: LLMs can create useful translation clues for low-resource languages entirely relying on their own internal capabilities.
Accumulation setting holds potential: Translation quality steadily improves as seed data increases, though it has not yet fully closed the gap with real-time generation.

Highlights & Insights¶

Resolves the "chicken-or-egg" dilemma: Low-resource languages lack demonstration pools \(\rightarrow\) Let the LLM generate its own demonstration pool.
The discovery of the "backfire phenomenon" is highly valuable: It reveals that high-quality but irrelevant fixed demonstrations can severely impair ICL performance.
Extremely straightforward methodology: The entire pipeline requires only zero-shot prompting + n-gram matching + MMR filtering.
Insight of Relevance > Quality: Challenges the intuition that "higher demonstration quality is always better," emphasizing the match with the query instead.

Limitations & Future Work¶

Evaluated only on the English \(\rightarrow\) Low-Resource direction: Reverse translation (low-resource \(\rightarrow\) English) was not explored due to expected limitations in source-side generation quality.
Llama-3.1 exclusive: Other multilingual models (e.g., Qwen, Gemma, etc.) were not tested.
Accumulation setting has not converged: An accumulated pool of 500 seed data instances still cannot fully match the performance of on-the-fly generation.
Target side depends entirely on LLM translation: Combing the approach with NMT models to generate higher-quality target-side text is a potential direction.
Computational overhead: Each test input requires multiple LLM calls (source sentence generation + translation + final translation), presenting real-time efficiency challenges.

Relationship to R-BM25 (Agrawal et al., 2023): While R-BM25 retrieves demonstrations from an existing corpus pool, DAT does not require one. The two can be combined: DAT generation \(\rightarrow\) accumulation \(\rightarrow\) R-BM25 retrieval.
Comparison to Self-ICL (Chen et al., 2023): Self-ICL also self-generates demonstrations but targets general tasks; DAT is specifically designed for MT with a dual prior of relevance + diversity.
Insight: In any ICL scenario where demonstrations are scarce, the paradigm of "letting the model generate its own reference first, then using it for assistance" might prove effective.

Rating¶

Novelty: ⭐⭐⭐⭐ — First to explore in-context example generation in the MT field without relying on any external resources.
Experimental Thoroughness: ⭐⭐⭐⭐ — Five low-resource languages, multi-dimensional analysis (quality/relevance/diversity), and complete ablation studies.
Writing Quality: ⭐⭐⭐⭐ — Clear motivation and in-depth analysis (the discovery and explanation of the "backfire" phenomenon are excellent).
Value: ⭐⭐⭐⭐ — Direct practical value for low-resource translation, with a simple and reproducible method.