GrammaMT: Improving Machine Translation with Grammar-Informed In-Context Learning¶
Conference: ACL 2025
arXiv: 2410.18702
Code: None
Area: Multilingual Translation
Keywords: Machine Translation, Grammatical Information, In-Context Learning, Low-Resource Languages, Interlinear Glossed Text
TL;DR¶
GrammaMT is proposed to leverage grammatical information from Interlinear Glossed Text (IGT) to enhance few-shot machine translation in LLMs, achieving an average improvement of 12+ BLEU on endangered languages and consistent improvements on medium- and high-resource languages.
Background & Motivation¶
Background: LLMs perform exceptionally well in machine translation for high-resource languages, but the translation quality for low-resource languages, especially endangered ones, remains poor. Leveraging existing LLMs to translate low-resource languages requires satisfying: (i) no training required; (ii) only a small amount of data is needed; (iii) data is easy to collect.
Limitations of Prior Work: Existing methods either rely on fine-tuning on large-scale corpora (unfeasible for low-resource languages) or require comprehensive resources like grammar books and dictionaries (e.g., LingoLLM), which are expensive to acquire. Standard few-shot methods are simple but yield limited performance.
Key Challenge: There is an urgent demand for translating low-resource/endangered languages, yet available data and linguistic resources are extremely scarce. The core challenge is how to maximize translation quality using minimal and most easily accessible linguistic information.
Goal: Design a training-free method requiring only a small number of linguistic annotations to enhance the multilingual translation capability of LLMs for low-resource languages.
Key Insight: Utilize Interlinear Glossed Text (IGT)—a common morpheme-level annotation format in linguistics containing triples of source sentences, gloss lines, and target translations—to inject grammatical information into LLM prompts.
Core Idea: Embed IGT grammatical annotations into few-shot prompts to exchange minimal linguistic annotation costs for significant improvements in low-resource translation.
Method¶
Overall Architecture¶
GrammaMT proposes three prompting strategies, all training-free and requiring only 21 IGT examples:
Key Designs¶
1. Gloss-shot¶
Add IGT annotation triples (source sentence + gloss + translation) into few-shot examples, allowing the LLM to learn source language structures from grammatical examples:
2. Chain-gloss¶
Similar to Chain-of-Thought, require the LLM to first generate the gloss of the input sentence and then perform translation:
This increases interpretability, but relies on the LLM's own ability to generate glosses.
3. Model-gloss¶
Use an external gloss generation model (such as GlossLM) to generate the gloss for the input sentence, avoiding inaccuracies in LLM's self-generated gloss:
IGT Format Description¶
IGT annotation is a standard format for linguistic description, such as Swahili: - Source: (yeye) alimwona (yeye) - Gloss: 3SG-PST-see-FV 3SG (uppercase = grammatical morpheme, lowercase = lexical morpheme) - Translation: S/he saw him/her
Experimental Setup¶
- Models: Meta-Llama-3-70B-Instruct (4-bit quantized) is primarily used, while 8B, Mixtral-8x22B, and GPT-4o are also tested
- N-shot: All strategies uniformly use 21 examples (proven optimal by ablation studies)
- Evaluation Metrics: BLEU (SacreBLEU), chrF++, xCOMET-XXL
- Decoding: Greedy decoding, temperature=1
Key Experimental Results¶
Main Results — Endangered/Unseen Languages (SIGMORPHON 2023)¶
| Method | BLEU Avg. | chrF++ Avg. | xCOMET Avg. |
|---|---|---|---|
| NLLB-200 | 0.55 | 13.80 | 12.82 |
| zero-shot | 0.88 | 18.05 | 15.21 |
| few-shot | 3.94 | 21.85 | 16.76 |
| gloss-shot | 3.41 | 22.50 | 18.21 |
| chain-gloss | 4.25 | 20.84 | 16.78 |
| model-gloss | 15.97 | 41.45 | 40.83 |
| LingoLLM (GPT-4) | 14.1 | — | — |
model-gloss improves by an average of 12.03 BLEU compared to few-shot, and outperforms LingoLLM, which utilizes more linguistic resources.
Main Results — Low-Resource Languages (GlossLM)¶
| Method | BLEU Avg. (5 lang) | chrF++ Avg. | xCOMET Avg. |
|---|---|---|---|
| few-shot | 16.69 | 36.96 | 34.52 |
| gloss-shot | 16.39 | 36.88 | 35.65 |
| chain-gloss | 17.06 | 37.10 | 35.77 |
Most significant improvement on Yoruba: few-shot 11.98 → gloss-shot 16.32 (+4.34 BLEU).
Main Results — Medium-to-High Resource Languages¶
| Method | BLEU Avg. (7 lang) |
|---|---|
| few-shot | 18.95 |
| gloss-shot | 18.61 |
| chain-gloss | 19.75 |
chain-gloss outperforms few-shot by 2.5+ BLEU on Urdu and Russian.
Ablation Study¶
| Analysis | Key Findings |
|---|---|
| N-shot count | N=21 is optimal; excessive increases lead to a performance plateau |
| Gloss Accuracy | Llama achieves only 21% word accuracy on Tsez, whereas GlossLM reaches 88% → directly explaining the advantage of model-gloss |
| Oracle Experiment | Using gold gloss improves performance by an average of 17.46 BLEU (±6.6); zero-gloss also outperforms few-shot → proving that gloss itself is highly valuable |
| Effect of Grammatical Annotations | Performance degrades when grammatical annotations are removed to keep only lexical morphemes → indicating grammatical information is more than word-to-word translation |
| Cross-model Generalization | model-gloss achieves 18.69 average BLEU on SIGMORPHON with GPT-4o; the smaller Llama-8B model is also effective |
| Out-of-Domain Generalization (FLORES) | gloss-shot performs best (avg 21.64 vs few-shot 20.69); chain-gloss performs unstably on out-of-domain data |
Key Findings¶
- model-gloss performs best on endangered languages: relies on the accuracy of an external gloss model but offers massive improvements
- chain-gloss is more practical for low/medium/high-resource languages: does not rely on an external model and yields improvements in most languages
- gloss-shot is the most robust in out-of-domain settings: only uses glosses as prompt examples without generating them, offering the widest applicability
- Oracle experiments reveal the upper bound: accurate glosses can bring an improvement of 17+ BLEU, showing that the development of automatic gloss generation models is highly worthwhile
Highlights & Insights¶
- Extremely low-threshold linguistic enhancement: requires only 21 IGT triples, and such annotations are very common in linguistic descriptions, making them much easier to obtain than grammar books and dictionaries
- Linguistic instantiation of Chain-of-Thought: chain-gloss is a natural variant of CoT in translation tasks, and grammatical annotations provide a more explicit structure than simply "letting the model think step-by-step"
- Three strategies covering different scenarios: model-gloss is suitable for languages with gloss models, chain-gloss is suitable for general scenarios, and gloss-shot is suitable for out-of-domain transfer
Limitations & Future Work¶
- Evaluates primarily translation into English (→en) directions; reverse translation (en→) is only explored preliminarily
- Interpretability of gloss-shot is limited; it remains unclear how glosses in unrelated examples affect the translation
- chain-gloss is unstable on out-of-domain data, pointing likely to mismatched distributions between short-sentence glosses in GlossLM and long sentences in FLORES
- Limited by the availability of IGT data, it does not cover all language families
Related Work & Insights¶
- LingoLLM (Zhang et al., 2024): uses grammar books + dictionaries + morphological analyzers → resource requirements are significantly higher than GrammaMT
- GlossLM (Ginn et al., 2024): gloss corpus of 250K sentences × 1800 languages → directly supporting the model-gloss strategy
- Tanzer et al. (2024): grammar book-assisted translation → similar idea but high resource requirements
- Chain-of-Thought (Wei et al., 2022): GrammaMT's chain-gloss is essentially CoT in the linguistic domain
Rating¶
- Novelty: ⭐⭐⭐⭐ — Utilizing IGT, a standard linguistic tool, to enhance LLM translation is a novel and natural idea
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 3 datasets × 16 languages × 4 models × multiple ablations, highly comprehensive
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, deep ablation analysis, and elegantly designed ablations
- Value: ⭐⭐⭐⭐⭐ — Significant practical value for endangered language translation; the method is simple, effective, and has a very low threshold