Exposing the Cracks: Vulnerabilities of Retrieval-Augmented LLM-Based Machine Translation¶

Conference: AAAI 2026 arXiv: 2510.00829 Code: GitHub Area: Information Retrieval Keywords: Retrieval-Augmented Translation, Noise Robustness, Context Over-Reliance, Calibration, Multilingual

TL;DR¶

This work develops a controlled noise injection framework to systematically evaluate retrieval-augmented machine translation (REAL-MT), introduces two new metrics—Fidelity and CAR—and reveals across 10 language pairs × 4 noise types that models blindly adopt retrieved context even when it is contradictory (CAR remains 65–78%). Large reasoning models (LRMs) are found to be even more vulnerable by "rationalizing" erroneous context, and a fundamental trade-off exists between noise robustness and clean-context utilization.

Background & Motivation¶

Background: Retrieval-augmented machine translation (REAL-MT) uses translation memories to assist LLM-based translation, yet retrieval quality cannot be guaranteed in real-world deployments.

Limitations of Prior Work: (a) The vulnerability of REAL-MT to noisy retrieved results remains unclear; (b) metrics specifically designed to evaluate translation–context interaction are lacking; (c) it is unknown whether large reasoning models (e.g., Qwen3-8B) are more robust; (d) vulnerability in low-resource language pairs has not been investigated.

Key Challenge: LLM-based translation relies heavily on retrieved context—beneficial when correct, but over-reliance propagates errors under noise. Furthermore, improving noise robustness degrades utilization of correct context.

Goal: Systematically quantify the noise vulnerability of REAL-MT and expose the fundamental trade-off therein.

Key Insight: Controlled noise injection—four semantic noise types simulating retrieval failures of varying severity.

Core Idea: LLM reliance on retrieved context is a double-edged sword; systematic noise analysis exposes the fundamental robustness–utilization trade-off.

Method¶

Overall Architecture¶

Input: 1,200 cross-lingual idiom/proverb translation instances × 10 language pairs. Four noise types are injected. Output: evaluation via Fidelity and CAR.

Key Designs¶

Four Semantic Noise Types (increasing deviation):
- \(\mathbb{N}_{literal}\): Literal translation (preserves surface form but fails to convey idiomatic meaning)
- \(\mathbb{N}_{semantic}\): Semantic perturbation (related but deviated meaning)
- \(\mathbb{N}_{opposite}\): Opposite meaning (directly contradictory translation)
- \(\mathbb{N}_{struct}\): Structural perturbation (correct meaning but altered expression)
- Design Motivation: Each type simulates a distinct severity level of real-world retrieval failure.
New Metrics:
- Fidelity: Whether the translation correctly conveys idiomatic meaning (rather than surface-level matching)
- CAR (Context Adoption Rate): The degree to which the translation depends on the retrieved context
- Noise quality validation: TER = 25.2, Sim(gold, struct) = 0.92, contradiction rate = 0.85
10 Language Pairs:
- High-resource: De→En, Fr→En, Zh→En
- Mid-resource: Hi→En
- Low-resource: Fi→En, Ja→En, etc.

Loss & Training¶

This is an evaluation study—no model training is performed. Open-source and closed-source models are tested with greedy decoding on H800 GPUs.

Key Experimental Results¶

Main Results (Hi→En, Qwen2.5-7B)¶

Condition	Fidelity	CAR (%)	Note
No context	0.8	65.8	Baseline
Correct meaning	2.1	78.4	+1.3 / +12.6
Structural noise	1.9	77.5	Correct but differently expressed
Literal noise	1.3	64.9	Below no-context baseline
Opposite meaning	Low	~70	Blind adoption of contradiction

Large Reasoning Model (LRM) Comparison¶

Model Type	Noise Robustness	Note
Standard LLM	Moderate	Baseline
LRM (Qwen3-8B)	Worse	More vulnerable
Reason	—	LRM "rationalizes" errors rather than detecting them

Mitigation Strategy Comparison¶

Strategy	Noise Robustness	Clean-Context Performance
No mitigation	Poor	Optimal
Training-time mitigation	Improved	Degraded
Inference-time mitigation	Improved	Degraded

Key Findings¶

CAR remains 65–78% under contradictory context—models treat retrieved context as authoritative ground truth.
LRMs are paradoxically more vulnerable: rather than detecting noise, LRMs construct justifications for why the erroneous context is plausible—a failure of metacognitive calibration.
The fundamental trade-off is unavoidable: all mitigation strategies improve noise robustness at the cost of reduced clean-context utilization.
Low-resource language pairs are especially vulnerable—limited intrinsic knowledge increases reliance on external context.

Highlights & Insights¶

"LRMs are more vulnerable" is a counterintuitive yet important finding—reasoning capability does not equate to calibration capability. LRMs rationalize clearly contradictory context rather than rejecting it, serving as an important warning for the deployment of large reasoning models.
The fundamental trade-off implies there is no free lunch—either trust the context (and fail under noise) or distrust it (and waste correct context). Novel self-verification integration mechanisms are needed.
The four-level noise design provides a standardized evaluation framework for future research.

Limitations & Future Work¶

Evaluation is limited to idiomatic/proverbial translation—noise sensitivity in general-domain translation may differ.
Noise is synthetically constructed—the distribution of real-world retrieval noise may be more complex.
Dynamic noise detection strategies (e.g., having the model assess context credibility before deciding adoption degree) are not explored.
Confidence calibration methods to mitigate over-reliance warrant further investigation.

vs. kNN-MT and other retrieval-augmented translation methods: Prior work focuses on how to exploit retrieved results effectively; this paper examines what happens when retrieved results are unreliable.
vs. RAG robustness studies: Similar noise robustness challenges exist in the broader RAG literature; the trade-off findings reported here are potentially generalizable.
Insight: Any LLM system relying on external knowledge faces an analogous "trust-vs-verify" dilemma.

Rating¶

Novelty: ⭐⭐⭐⭐ Systematic robustness evaluation framework for retrieval-augmented translation
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 10 language pairs × 4 noise types × multiple models + mitigation strategy analysis
Writing Quality: ⭐⭐⭐⭐ In-depth analysis
Value: ⭐⭐⭐⭐ Important cautionary findings for safe deployment of RAG-based translation