XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics¶

Conference: ACL 2026
arXiv: 2604.14934
Code: GitHub
Area: AI Safety
Keywords: Translation Evaluation Metrics, Cross-Lingual Scoring Bias, MQM Error Injection, Multilingual Benchmark, Metric Calibration

TL;DR¶

This paper constructs XQ-MEval, the first translation evaluation benchmark with cross-lingual parallel quality, using semi-automated MQM error injection to generate pseudo-translations with controllable quality, empirically revealing cross-lingual scoring bias in automatic evaluation metrics for the first time, and proposing an LGN normalization strategy that effectively calibrates multilingual metric evaluation.

Background & Motivation¶

Background: Evaluation of multilingual translation systems typically relies on automatic metrics (COMET, MetricX, etc.), with the standard practice of averaging metric scores across language directions as the system-level score. MQM human evaluation achieves cross-lingual comparability through standardized error categories and hierarchical penalty scoring.

Limitations of Prior Work: The averaging strategy implicitly assumes that different languages score on the same scale for similar errors, but in practice metrics may exhibit cross-lingual scoring bias — translations of the same quality receive different scores in different languages. For example, translations with the same major error receive significantly different COMET scores across languages.

Key Challenge: No benchmark dataset provides cross-lingual parallel quality instances, making it impossible to systematically quantify and verify scoring bias. Expert annotation costs are extremely high, limiting language coverage.

Goal: (1) Construct a cross-lingual parallel quality benchmark; (2) quantify cross-lingual scoring bias; (3) propose calibration strategies to improve fairness in multilingual evaluation.

Key Insight: Automatically inject MQM-defined errors into high-quality reference translations, generating pseudo-translations with controllable quality through controlling error quantity, with native speaker filtering to ensure reliability.

Core Idea: By injecting a controllable number of MQM errors into Flores high-quality translations, construct cross-lingual quality-parallel triplets (source, pseudo-translation, reference) so that cross-lingual comparisons are grounded in identical errors.

Method¶

Overall Architecture¶

Three-stage pipeline: (1) Phrase-level — uses GPT-4o to inject a single MQM major error into reference translations, with native speaker filtering; (2) Sentence-level — merges 0–5 errors to generate pseudo-translations at six quality levels; (3) System-level — assembles triplets (source + pseudo-translation + reference) to construct pseudo-systems, evaluating automatic metrics with predefined scores. Covers 9 translation directions.

Key Designs¶

Semi-Automated Error Injection and Filtering:
- Function: Generate cross-lingual quality-parallel single-error candidates
- Mechanism: For each reference translation in Flores, GPT-4o injects four MQM error types (Addition, Omission, Mistranslation, Untranslated), in both the first and second half of the sentence, producing up to 8 candidates per instance. Two native speakers independently review, retaining only unanimously approved candidates. Error types are chosen to be purely semantic, ensuring cross-lingual comparability
- Design Motivation: LLM injection + human filtering balances cost and reliability, being far more economical than pure expert annotation while ensuring semantic equivalence of errors across languages
Controllable Quality Pseudo-Translation Generation:
- Function: Generate pseudo-translations with predefined MQM scores
- Mechanism: Merges 0–5 non-overlapping error segments from the error pool to generate pseudo-translations, where 0 errors corresponds to a perfect score (0 penalty) and 5 errors corresponds to the worst (-25 points, -5 per major error). Each language has parallel instances at each quality level, with quality tiers aligned across languages
- Design Motivation: Controlling error quantity achieves a controllable quality gradient, enabling direct comparison of "same-quality" translations across languages
LGN Normalization Calibration Strategy:
- Function: Eliminate cross-lingual scoring bias and equalize multilingual system evaluation
- Mechanism: Language-specific Global Normalization — for each language direction, estimates the metric score distribution (mean and standard deviation) using pseudo-translation scores from XQ-MEval, then applies z-score normalization to actual evaluation scores, mapping all languages to the same scale before averaging
- Design Motivation: When directly averaging scores across languages, high scores in high-resource languages can mask low scores in low-resource languages; LGN makes scores comparable across languages

Loss & Training¶

XQ-MEval is an evaluation benchmark rather than a training method, involving no model training. LGN is a test-time calibration strategy that only requires XQ-MEval data to estimate score distribution parameters for each language.

Key Experimental Results¶

Main Results¶

Metric	Average Strategy Consistency	LGN Consistency	Note
COMET-22	Low	Significantly improved	One of the regression metrics with worst cross-lingual bias
MetricX-23	Low	Improved	Similar bias issues
BLEU	Medium	Improved	Sequence metrics show less bias
chrF	Medium	Improved	Character-level metrics relatively robust

Ablation Study¶

Analysis Dimension	Finding
Bias manifestation 1: Same quality, different scores	With 1 major error, COMET score difference exceeds 0.1 between en-zh and en-ja
Bias manifestation 2: Inconsistent quality degradation rates	Score decline slopes vary significantly across languages as errors increase from 0 to 5
LGN vs direct averaging	LGN significantly reduces cross-language score range differences

Key Findings¶

First empirical evidence that automatic translation metrics exhibit systematic cross-lingual scoring bias, most severe in regression metrics (COMET, MetricX)
Bias manifests in two ways: (1) different scores for same quality; (2) inconsistent quality degradation rates across languages
The direct averaging strategy shows clear inconsistency with MQM human evaluation
LGN normalization effectively mitigates bias, improving fairness and reliability of multilingual evaluation
Low-resource languages (lo, si) typically exhibit more severe bias

Highlights & Insights¶

Raises a previously overlooked but critically important problem — cross-lingual scoring bias in translation metrics directly affects the fairness of multilingual system selection. This has practical implications for NMT competition rankings and product decisions
The semi-automated construction method (LLM injection + human filtering) is an ingenious compromise, enabling coverage of 9 languages. This pipeline is generalizable to constructing other cross-lingual quality-aligned benchmarks
The LGN strategy, while simple, is remarkably effective with low implementation cost and can be directly applied to existing evaluation workflows

Limitations & Future Work¶

Pseudo-translations are synthetic, with distributional differences from real translation system outputs
Only covers 4 MQM error types (accounting for 46.3% of total errors); fluency and other types are not included
Some low-resource languages have only 1 reviewer, slightly reducing reliability
Flores has only 102 instances per direction, limiting scale
Future work could extend to more languages and error types, and explore more sophisticated calibration methods

vs WMT MQM: WMT MQM is expert annotation for single language directions, unable to provide cross-lingual parallel quality; this paper achieves parallelism through synthesis
vs COMET/MetricX: This paper reveals systematic bias in these SOTA metrics, showing that direct score averaging may mislead system selection
vs Von Däniken et al. 2025: Found that metrics may be inconsistent even within a single direction; this paper extends the analysis to the cross-lingual dimension

Rating¶

Novelty: ⭐⭐⭐⭐ First systematic study and quantification of cross-lingual bias in translation metrics
Experimental Thoroughness: ⭐⭐⭐⭐ 9 languages × 9 metrics — broad coverage
Writing Quality: ⭐⭐⭐⭐⭐ Clear problem motivation, rigorous pipeline design
Recommendation: ⭐⭐⭐⭐ Direct practical guidance for NMT evaluation fairness