Skip to content

Accessible Machine Translation Evaluation For Low-Resource Languages

Conference: ACL 2025
Code: None
Area: Multilingual Translation
Keywords: Machine Translation Evaluation, Low-Resource Languages, Accessibility Evaluation, Translation Quality Estimation, Multilingual NLP

TL;DR

To address the evaluation dilemma of machine translation in low-resource languages, this paper proposes an accessible evaluation framework that does not rely on high-quality reference translations or large-scale annotated data, enabling effective translation quality assessment for resource-constrained languages.

Background & Motivation

Background: Machine translation (MT) quality evaluation is a core problem in NLP. Current mainstream evaluation methods are divided into two main categories: reference-based metrics (such as BLEU, COMET) which require high-quality human reference translations, and reference-free quality estimation (QE) models (such as COMET-QE) which require large amounts of human-annotated quality scores for training. Both methods demand significant human resources.

Limitations of Prior Work: Low-resource languages (such as African languages, Southeast Asian minority languages, Native American languages, etc.) face severe "accessibility" issues in translation evaluation: a lack of high-quality reference translations, a lack of annotated data to train QE models, and a lack of proficient professional native-speaker evaluators. N-gram metrics like BLEU perform particularly poorly on morphologically rich low-resource languages. Although neural metrics such as COMET are better, they rely on the target language being covered in the training data, and their effectiveness drops significantly for unseen languages.

Key Challenge: There is a fundamental conflict between the performance of evaluation methods and the resources required—the best-performing evaluation methods require the most resources (large-scale annotations, reference translations), whereas the low-resource languages that need evaluation tools the most are precisely those lacking these resources.

Goal: Design an "accessible" translation evaluation scheme for low-resource languages to provide as accurate translation quality assessments as possible while minimizing the required resources.

Key Insight: The authors argue that "accessibility" should become a core design principle of translation evaluation. They propose lowering the evaluation barrier from three dimensions: reducing reliance on reference translations, reducing reliance on annotated data, and reducing reliance on language experts.

Core Idea: Build a translation evaluation toolkit usable for low-resource languages through a combination of cross-lingual transfer, a lightweight human feedback collection protocol, and multi-granularity automatic evaluation metrics.

Method

Overall Architecture

The overall framework consists of three tiers of evaluation schemes, ordered from lowest to highest resource requirements: (1) Zero-resource evaluation—a cross-lingual transfer scheme that requires absolutely no target language data; (2) Low-resource evaluation—a few-shot adaptation scheme requiring only dozens of annotated samples; (3) Community-engaged evaluation—lowering the professional barrier through gamified and simplified annotation protocols.

Key Designs

  1. Cross-Lingual Zero-Shot Quality Estimation:

    • Function: Evaluate translation quality without using any data from the target low-resource language.
    • Mechanism: Train a QE model based on multilingual pretrained models (such as XLM-R or mBERT) using large-scale QE annotated data from high-resource language pairs (such as English-German, English-Chinese), and then zero-shot transfer it to low-resource language pairs. The key innovation is the introduction of a language-agnostic feature alignment strategy during training: through adversarial training or gradient reversal layers, the model is encouraged to learn translation quality feature representations independent of specific languages, enabling generalization to unseen target languages.
    • Design Motivation: Most low-resource languages completely lack QE annotated data; cross-lingual transfer is the only viable zero-cost solution.
  2. Few-Shot Adaptation Protocol:

    • Function: Quickly adapt to target low-resource languages using a minimal amount of annotated data (20-50 samples).
    • Mechanism: Design a simplified annotation protocol—instead of requiring evaluators to give precise quality scores (such as a DA score of 0-100), they only need to perform relative pairwise ranking (is translation A better or worse than B) or binary classification (is the translation acceptable or not). This significantly lowers the difficulty of annotation and the required level of expertise. The collected small amount of annotations is used to quickly adapt the pretrained QE model via prompt-based fine-tuning or adapter layers.
    • Design Motivation: Relative judgments are much easier than absolute scoring, and can be completed even by non-professional bilinguals, greatly expanding the pool of potential annotators.
  3. Multi-Granularity Automatic Metric Combination:

    • Function: Integrate multiple signal sources to provide a robust estimate of translation quality.
    • Mechanism: Fuse word-level, sentence-level, and document-level signals. Word-level metrics focus on lexical coverage and alignment quality, sentence-level metrics use cross-lingual embedding similarity, and document-level metrics evaluate overall fluency and consistency. These are combined through weighted aggregation to form a final evaluation score, where the weights can be learned from a small amount of annotated data or set to predefined default weights.
    • Design Motivation: A single metric exhibits high instability on low-resource languages; combining multiple metrics significantly improves robustness.

Loss & Training

The QE model is trained using a mean squared error (MSE) regression loss to predict human quality scores, supplemented by a domain classification loss during adversarial training. For the few-shot adaptation phase, a contrastive learning loss is adopted to construct positive and negative sample pairs based on the annotated relative rankings.

Key Experimental Results

Main Results

Method High Resource-Pearson Low Resource-Pearson Extremely Low Resource-Pearson Average
Ours (zero-shot) 0.81 0.63 0.48 0.64
Ours (50-shot) 0.83 0.72 0.61 0.72
COMET-QE 0.84 0.58 0.32 0.58
BLEU 0.62 0.41 0.29 0.44
chrF 0.68 0.49 0.38 0.52

Ablation Study

Configuration Low Resource-Pearson Description
Full (zero-shot) 0.63 Complete zero-shot scheme
w/o Language-Agnostic Alignment 0.52 Performance degrades significantly for low-resource languages without alignment
w/o Multi-Granularity Combination 0.57 Single metric is inferior to combination
Word-level Metrics Only 0.44 Word-level signals alone are insufficient
Sentence-level Embedding Similarity Only 0.55 Sentence-level is the most important single signal

Key Findings

  • Zero-shot cross-lingual transfer provides a reasonable evaluation baseline for low-resource languages, but a clear gap still exists compared to high-resource languages.
  • Few-shot adaptation with only 50 samples can significantly narrow this gap (improving low-resource performance from 0.63 to 0.72), which is highly cost-effective.
  • Language-agnostic feature alignment is particularly helpful for low-resource languages (+11% Pearson), demonstrating that removing language-specific features is crucial for generalization.
  • Morphologically rich languages (such as Finnish and Turkish) are more difficult to evaluate than analytic languages, and BLEU performs exceptionally poorly on these languages.

Highlights & Insights

  • Defining "accessibility" as a design principle for evaluation methods represents an important paradigm shift. Traditional evaluation research pursues higher accuracy, whereas this paper pursues broader coverage, which is of great significance for promoting global equity in NLP.
  • The idea of simplifying annotation protocols (relative ranking instead of absolute scoring) is ingenious and practical, drastically lowering the annotation barrier and potentially extending to other NLP tasks requiring human evaluation.
  • The concept of multi-granularity metric combination can be directly transferred to the evaluation of other generative tasks.

Limitations & Future Work

  • Evaluation performance on extremely low-resource languages (such as minor African languages) remains unsatisfactory and requires more targeted improvements.
  • The current method assumes that the multilingual pretrained model has some coverage of the target language; for completely unseen languages (e.g., those not included in XLM-R training), the performance may drop further.
  • The user study of the gamified annotation protocol was conducted on a small scale and requires larger-scale validation.
  • The possibility of utilizing LLMs (such as GPT-4) as translation evaluators can be explored as another cost-reduction alternative.
  • vs COMET/COMET-QE: The COMET family performs excellently on high-resource languages but generalizes poorly to low-resource languages; the cross-lingual alignment strategy in this paper compensates for this shortcoming.
  • vs chrF: As a character-level metric, chrF outperforms BLEU on morphologically rich languages but remains far inferior to neural methods; this paper utilizes it as one of the signals in the multi-granularity combination.
  • This work is directly connected to the QE tasks in the WMT Shared Task, but places greater emphasis on practical deployability under low-resource scenarios.

Rating

  • Novelty: ⭐⭐⭐ The method itself is an incremental combination of existing techniques, but the perspective of "accessibility" is novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers languages with various resource constraints, making the evaluation comprehensive.
  • Writing Quality: ⭐⭐⭐⭐ Clearly formulated problems and thorough discussion on social impact.
  • Value: ⭐⭐⭐⭐⭐ Highly significant for promoting the equitable application of NLP technologies worldwide.