Skip to content

GrammaMT: Improving Machine Translation with Grammar-Informed In-Context Learning

Conference: ACL 2025
arXiv: 2410.18702
Code: None
Area: Multilingual Translation
Keywords: Machine Translation, Grammatical Information, In-Context Learning, Low-Resource Languages, Interlinear Glossed Text

TL;DR

GrammaMT is proposed to leverage grammatical information from Interlinear Glossed Text (IGT) to enhance few-shot machine translation in LLMs, achieving an average improvement of 12+ BLEU on endangered languages and consistent improvements on medium- and high-resource languages.

Background & Motivation

Background: LLMs perform exceptionally well in machine translation for high-resource languages, but the translation quality for low-resource languages, especially endangered ones, remains poor. Leveraging existing LLMs to translate low-resource languages requires satisfying: (i) no training required; (ii) only a small amount of data is needed; (iii) data is easy to collect.

Limitations of Prior Work: Existing methods either rely on fine-tuning on large-scale corpora (unfeasible for low-resource languages) or require comprehensive resources like grammar books and dictionaries (e.g., LingoLLM), which are expensive to acquire. Standard few-shot methods are simple but yield limited performance.

Key Challenge: There is an urgent demand for translating low-resource/endangered languages, yet available data and linguistic resources are extremely scarce. The core challenge is how to maximize translation quality using minimal and most easily accessible linguistic information.

Goal: Design a training-free method requiring only a small number of linguistic annotations to enhance the multilingual translation capability of LLMs for low-resource languages.

Key Insight: Utilize Interlinear Glossed Text (IGT)—a common morpheme-level annotation format in linguistics containing triples of source sentences, gloss lines, and target translations—to inject grammatical information into LLM prompts.

Core Idea: Embed IGT grammatical annotations into few-shot prompts to exchange minimal linguistic annotation costs for significant improvements in low-resource translation.

Method

Overall Architecture

GrammaMT proposes three prompting strategies, all training-free and requiring only 21 IGT examples:

Key Designs

1. Gloss-shot

Add IGT annotation triples (source sentence + gloss + translation) into few-shot examples, allowing the LLM to learn source language structures from grammatical examples:

\[(\mathbf{g}_1, \cdots, \mathbf{g}_N, \mathbf{x}) \rightarrow \mathbf{y}\]

2. Chain-gloss

Similar to Chain-of-Thought, require the LLM to first generate the gloss of the input sentence and then perform translation:

\[(\mathbf{g}_1, \cdots, \mathbf{g}_N, \mathbf{x}) \rightarrow (\mathbf{y}_g, \mathbf{y})\]

This increases interpretability, but relies on the LLM's own ability to generate glosses.

3. Model-gloss

Use an external gloss generation model (such as GlossLM) to generate the gloss for the input sentence, avoiding inaccuracies in LLM's self-generated gloss:

\[(\mathbf{g}_1, \cdots, \mathbf{g}_N, \mathbf{x}, \mathbf{y}_{ge}) \rightarrow \mathbf{y}\]

IGT Format Description

IGT annotation is a standard format for linguistic description, such as Swahili: - Source: (yeye) alimwona (yeye) - Gloss: 3SG-PST-see-FV 3SG (uppercase = grammatical morpheme, lowercase = lexical morpheme) - Translation: S/he saw him/her

Experimental Setup

  • Models: Meta-Llama-3-70B-Instruct (4-bit quantized) is primarily used, while 8B, Mixtral-8x22B, and GPT-4o are also tested
  • N-shot: All strategies uniformly use 21 examples (proven optimal by ablation studies)
  • Evaluation Metrics: BLEU (SacreBLEU), chrF++, xCOMET-XXL
  • Decoding: Greedy decoding, temperature=1

Key Experimental Results

Main Results — Endangered/Unseen Languages (SIGMORPHON 2023)

Method BLEU Avg. chrF++ Avg. xCOMET Avg.
NLLB-200 0.55 13.80 12.82
zero-shot 0.88 18.05 15.21
few-shot 3.94 21.85 16.76
gloss-shot 3.41 22.50 18.21
chain-gloss 4.25 20.84 16.78
model-gloss 15.97 41.45 40.83
LingoLLM (GPT-4) 14.1

model-gloss improves by an average of 12.03 BLEU compared to few-shot, and outperforms LingoLLM, which utilizes more linguistic resources.

Main Results — Low-Resource Languages (GlossLM)

Method BLEU Avg. (5 lang) chrF++ Avg. xCOMET Avg.
few-shot 16.69 36.96 34.52
gloss-shot 16.39 36.88 35.65
chain-gloss 17.06 37.10 35.77

Most significant improvement on Yoruba: few-shot 11.98 → gloss-shot 16.32 (+4.34 BLEU).

Main Results — Medium-to-High Resource Languages

Method BLEU Avg. (7 lang)
few-shot 18.95
gloss-shot 18.61
chain-gloss 19.75

chain-gloss outperforms few-shot by 2.5+ BLEU on Urdu and Russian.

Ablation Study

Analysis Key Findings
N-shot count N=21 is optimal; excessive increases lead to a performance plateau
Gloss Accuracy Llama achieves only 21% word accuracy on Tsez, whereas GlossLM reaches 88% → directly explaining the advantage of model-gloss
Oracle Experiment Using gold gloss improves performance by an average of 17.46 BLEU (±6.6); zero-gloss also outperforms few-shot → proving that gloss itself is highly valuable
Effect of Grammatical Annotations Performance degrades when grammatical annotations are removed to keep only lexical morphemes → indicating grammatical information is more than word-to-word translation
Cross-model Generalization model-gloss achieves 18.69 average BLEU on SIGMORPHON with GPT-4o; the smaller Llama-8B model is also effective
Out-of-Domain Generalization (FLORES) gloss-shot performs best (avg 21.64 vs few-shot 20.69); chain-gloss performs unstably on out-of-domain data

Key Findings

  • model-gloss performs best on endangered languages: relies on the accuracy of an external gloss model but offers massive improvements
  • chain-gloss is more practical for low/medium/high-resource languages: does not rely on an external model and yields improvements in most languages
  • gloss-shot is the most robust in out-of-domain settings: only uses glosses as prompt examples without generating them, offering the widest applicability
  • Oracle experiments reveal the upper bound: accurate glosses can bring an improvement of 17+ BLEU, showing that the development of automatic gloss generation models is highly worthwhile

Highlights & Insights

  • Extremely low-threshold linguistic enhancement: requires only 21 IGT triples, and such annotations are very common in linguistic descriptions, making them much easier to obtain than grammar books and dictionaries
  • Linguistic instantiation of Chain-of-Thought: chain-gloss is a natural variant of CoT in translation tasks, and grammatical annotations provide a more explicit structure than simply "letting the model think step-by-step"
  • Three strategies covering different scenarios: model-gloss is suitable for languages with gloss models, chain-gloss is suitable for general scenarios, and gloss-shot is suitable for out-of-domain transfer

Limitations & Future Work

  • Evaluates primarily translation into English (→en) directions; reverse translation (en→) is only explored preliminarily
  • Interpretability of gloss-shot is limited; it remains unclear how glosses in unrelated examples affect the translation
  • chain-gloss is unstable on out-of-domain data, pointing likely to mismatched distributions between short-sentence glosses in GlossLM and long sentences in FLORES
  • Limited by the availability of IGT data, it does not cover all language families
  • LingoLLM (Zhang et al., 2024): uses grammar books + dictionaries + morphological analyzers → resource requirements are significantly higher than GrammaMT
  • GlossLM (Ginn et al., 2024): gloss corpus of 250K sentences × 1800 languages → directly supporting the model-gloss strategy
  • Tanzer et al. (2024): grammar book-assisted translation → similar idea but high resource requirements
  • Chain-of-Thought (Wei et al., 2022): GrammaMT's chain-gloss is essentially CoT in the linguistic domain

Rating

  • Novelty: ⭐⭐⭐⭐ — Utilizing IGT, a standard linguistic tool, to enhance LLM translation is a novel and natural idea
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 3 datasets × 16 languages × 4 models × multiple ablations, highly comprehensive
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure, deep ablation analysis, and elegantly designed ablations
  • Value: ⭐⭐⭐⭐⭐ — Significant practical value for endangered language translation; the method is simple, effective, and has a very low threshold