NeurIPS 2025 low-resource translation self-reflection structured prompting LLM translation isiZulu isiXhosa RAKE masking

Reflective Translation: Improving Low-Resource Machine Translation via Structured Self-Reflection¶

Conference: NeurIPS 2025 arXiv: 2601.19871 Code: GitHub Area: Machine Translation / Low-Resource Languages / Prompt Engineering Keywords: low-resource translation, self-reflection, structured prompting, LLM translation, isiZulu, isiXhosa, RAKE masking

TL;DR¶

This paper proposes the Reflective Translation framework, which enables LLMs to perform structured self-critique of their initial translations at inference time—identifying mistranslations, omissions, and semantic distortions—and subsequently generate revised translations based on this critique. The approach requires no fine-tuning or additional annotated data, yet achieves statistically significant improvements in BLEU and COMET on low-resource African languages such as isiZulu and isiXhosa.

Background & Motivation¶

Background: Machine translation (MT) demands linguistic accuracy, semantic fidelity, and contextual consistency. In recent years, large language models (LLMs) have demonstrated strong translation capabilities without task-specific fine-tuning, with models such as GPT and Claude achieving competitive or even excellent performance on high-resource language pairs. However, in low-resource settings, where parallel corpora are extremely scarce, LLM translation quality remains substantially deficient—hallucinations, omissions, and semantic distortions occur at far higher rates than in high-resource languages. Robinson et al. (2023) demonstrate that ChatGPT is competitive on high-resource translation but degrades significantly on low-resource languages. isiZulu (Zulu, approximately 12 million native speakers) and isiXhosa (Xhosa, approximately 8 million native speakers) of South Africa are representative low-resource languages of the Southern Bantu family, characterized by complex morphological features—including noun class systems and rich verbal morphology—that render translation substantially more challenging than European language pairs. In isiZulu, for instance, the noun class system comprises 15+ classes with distinct prefix patterns, while subject agreement prefixes, object prefixes, tense markers, and aspect markers on verbs combine to encode complete sentence-level semantic information. This "polysynthetic" character poses a severe challenge to LLM translation competence.

Limitations of Prior Work: Two main approaches currently exist for improving low-resource translation quality. The first is collecting additional parallel corpora for fine-tuning; however, for truly low-resource languages, annotation costs are high, data acquisition is difficult, and even fine-tuned models exhibit limited out-of-domain generalization. The second is leveraging the zero-shot or few-shot capabilities of multilingual pretrained models; yet experiments show that models such as ChatGPT perform far below their high-resource levels on low-resource languages, particularly on morphologically complex African languages. More critically, errors in LLM translations are frequently not a matter of complete ignorance but rather of incorrect details—wrong tense, confused noun class markers, or omitted key semantic components. Such errors fall precisely within the category of mistakes that the model itself is capable of identifying and correcting.

Key Challenge: LLMs possess a degree of translation capability for low-resource languages, yet initial translations frequently contain semantic errors that the model itself could identify and correct. Conventional translation pipelines perform only a single forward generation pass and lack any post-hoc correction mechanism. The central question is: how can a model's existing knowledge be leveraged to repair fine-grained translation errors without incurring additional training costs? In other words, the problem is not that the model "does not know" how to translate, but that it "fails to get it right the first time." If the model is given the opportunity to review and revise its output, can translation quality be substantially improved? This inference-time error-correction paradigm has been validated in reasoning and code generation, but has not been systematically studied in machine translation, particularly for low-resource languages.

Goal: Specifically, the authors decompose the problem into three dimensions: (1) How can LLMs be made to systematically identify error types in their own translations (mistranslation, omission, semantic distortion)? (2) How can error diagnoses be converted into actionable correction guidance? (3) How can the model be ensured to genuinely "re-understand and re-translate" during correction, rather than simply copying content from the reflection text?

Key Insight: Self-reflection mechanisms have recently demonstrated significant effects in LLM reasoning. Reflexion (Shinn et al., 2023) employs linguistic feedback reinforcement learning; Self-Refine (Madaan et al., 2023) improves generation quality through iterative self-feedback; Chain-of-Verification (Creswell & Shanahan, 2023) enhances factual consistency through verification chains. A shared insight across these works is that LLMs are often capable of identifying and improving upon problems in their own outputs. The authors transfer this perspective to translation, framing it as a form of constrained reasoning—the target sentence must preserve the complete semantics of the source sentence. Under this framing, self-reflection becomes an inference-time error-correction mechanism.

Core Idea: Through structured multi-turn prompting, the LLM is guided to first translate, then self-diagnose errors, and finally revise accordingly—achieving significant translation quality improvements on low-resource languages without any fine-tuning.

Method¶

Overall Architecture¶

Reflective Translation is a multi-stage prompting pipeline operating at inference time, involving no model parameter updates. The overall process consists of three steps:

Input: A source-language text segment (e.g., an isiZulu or isiXhosa sentence).

Stage 1 — First-Pass Translation: The source sentence is submitted to the LLM with a standard translation prompt to generate an initial English translation. The prompt format is carefully designed to require the model to wrap its output in <START_TRANSLATION> and <END_TRANSLATION> tags, facilitating automated parsing. This step is equivalent to a conventional zero-shot or few-shot translation baseline and serves as the anchor for the subsequent reflection-correction pipeline. Depending on the prompting strategy, the initial translation may be generated via zero-shot, chain-of-thought-style, or few-shot prompting.

Stage 2 — Structured Reflection: This is the core innovation of the framework. Rather than directly revising the translation, the model first generates a structured self-critique report comprising three components: error identification (mistranslation, omission, distortion); high-level correction guidance (general, reusable correction principles); and critical semantic content that must be preserved. The input to this stage is the source text together with the first-pass translation, and the output is structured text organized according to the three sub-components. To prevent the model from directly leaking the correct translation at this stage—thereby reducing the third stage to mere copying—the framework introduces a RAKE keyword masking mechanism that replaces key content words in the reflection text with <MASK> placeholders.

Stage 3 — Second-Pass Translation: The model receives both the original source text and the masked reflection report, and under the joint guidance of these two signals, regenerates an improved translation. Since substantive content words in the reflection report have been masked, the model cannot simply copy answers from the reflection text; instead, it must re-understand the source text and generate a revised translation guided by the identified error types and correction directions. The revised translation is likewise output in tag-wrapped format.

Output: The corrected target-language translation, together with a complete reflection record containing full details of the error diagnosis and correction guidance. As a by-product, the pipeline also produces (source text, first-pass translation, reflection, corrected translation) quadruples, which may be used for subsequent supervised training research.

Key Designs¶

Structured Reflection Template:
- Function: Constrains the format of the model's self-critique output, systematizing and standardizing the reflection process.
- Mechanism: The reflection report is divided into three clearly defined sub-components. (a) Error Identification: The model must identify key mistranslations, omissions, or semantic distortions in the initial translation and specify their locations. This is not a vague assessment of "poor translation" but requires the model to localize errors at the phrase or word-group level. (b) High-level Fixes: The model must provide general, reusable correction principles—e.g., "preserve proper nouns in the original form," "correct tense/aspect markers," "repair subject-verb agreement." These are abstract translation strategies rather than specific rewriting suggestions, intended to prompt the model to approach correction from first principles rather than surface-level phrasing. (c) Critical Content Constraints: Explicitly enumerates the core semantic fragments or constraints from the source sentence that must be preserved in the translation, ensuring that fixing one error does not introduce new omissions.
- Design Motivation: Unconstrained free-form reflection tends to produce verbose, unstructured self-assessment that can interfere with the correction process. By forcing reflection to decompose into three functionally complementary sub-tasks—each with clearly defined output expectations—the model can perform diagnosis and correction more efficiently. Furthermore, this structured design makes reflection outputs recordable and analyzable as annotation data, supporting subsequent research on reflection behavior.
RAKE Keyword Masking Mechanism:
- Function: Before the reflection text is passed to Stage 3, automatically extracts and masks substantive content words within it, preventing the model from "taking a shortcut" by copying correct translation fragments already present in the reflection.
- Mechanism: The RAKE (Rapid Automatic Keyword Extraction) algorithm (implemented via NLTK) is applied to extract key phrases from the reflection text. RAKE identifies the most informative phrases in a text by analyzing word frequency and co-occurrence matrices, requiring no training data or predefined lexicons. Extracted key phrases are replaced with <MASK> placeholders, so that the reflection text passed to Stage 3 retains its structure and correction intent while the specific target-language vocabulary is concealed. The model must therefore retranslate from the source text rather than transcribe from the reflection.
- Design Motivation: A known problem in self-reflection research is "information leakage"—if the correct answer is already written out in the reflection stage, then improvement at the correction stage reflects not genuine understanding of the error but merely the availability of the correct answer in context. RAKE masking is a lightweight, training-free solution that forces the model to apply reflection results at the semantic rather than lexical level. This design is among the most technically elegant in the framework: it preserves information about "error types" and "correction directions" from the reflection while preventing direct copying. The choice of RAKE is also judicious: as an unsupervised, statistically driven method, it requires no domain-specific resources and is particularly well-suited to low-resource settings.
Multi-Strategy Prompt Evaluation:
- Function: Systematically evaluates three prompting strategies within the framework—zero-shot baseline, chain-of-thought-style (CoT-style), and few-shot prompting—to isolate the contribution of the reflection mechanism from that of the prompting strategy.
- Mechanism: The three strategies correspond to different degrees of "prior knowledge injection." The zero-shot baseline provides only the translation instruction; the CoT-style prompt adds the requirement to "perform internal reasoning before translating," without requiring the model to display its reasoning; the few-shot prompt provides two isiZulu→English translation examples in context. These three strategies are each combined with the reflection-correction pipeline, yielding \(3 \times 2 = 6\) experimental conditions (first-pass and second-pass for each strategy), enabling analysis of which baseline benefits most from the reflection mechanism and whether few-shot examples help stabilize reflection behavior.
- Design Motivation: Testing a single prompting strategy makes it impossible to disentangle whether translation improvements stem from the reflection mechanism or from a better prompt. Multi-strategy evaluation ensures the robustness of conclusions—if reflection yields improvements across all strategies, its effect is general and orthogonal to the prompting strategy.

Loss & Training¶

The proposed method operates entirely at inference time and involves no model training or parameter updates. All improvements are achieved through prompt engineering. Evaluation employs two automatic translation quality metrics:

BLEU: A classical metric based on n-gram precision matching, formulated as \(\text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)\), where \(BP\) is the brevity penalty, \(p_n\) is the n-gram precision, and \(w_n\) is the weight. BLEU primarily measures lexical-level translation accuracy.
COMET: A learned neural translation evaluation metric, \(\text{COMET}(x, y) = f_\theta(x, y)\), which scores semantic similarity between source and target sentences using a pretrained cross-lingual encoder. COMET is better suited to capturing semantic-level translation quality improvements: translations that are semantically correct but lexically divergent from the reference can still receive high scores.

The complementary use of both metrics is meaningful: BLEU is a "hard metric" oriented toward surface-level matching, while COMET is a "soft metric" oriented toward semantics. If the reflection mechanism primarily repairs semantic errors rather than lexical choice issues, COMET improvements would be expected to exceed BLEU improvements—which is precisely what is observed in the experiments.

Key Experimental Results¶

Experimental Setup¶

Models: GPT-3.5 (OpenAI) and Claude Haiku 3.5 (Anthropic), both general-purpose LLMs without translation-specific fine-tuning. These two models were selected to verify the model-agnostic nature of the framework.
Language Pairs: English↔isiZulu (OPUS-100 dataset) and English↔isiXhosa (NTREX-African dataset). Both languages belong to the Southern Bantu family, featuring noun class systems and rich verbal morphology, and are representative low-resource, morphologically complex languages.
Datasets: OPUS-100 provides broad multilingual parallel data, while NTREX-African is a curated evaluation set specifically for African languages. The former offers broad coverage but variable quality; the latter offers higher quality but smaller scale. The English-isiZulu portion of OPUS-100 contains sentence pairs from multiple sources (e.g., JW300, Ubuntu localization, Wikipedia) spanning diverse domains; NTREX-African consists of news-domain texts translated by native speakers, with higher translation quality and a more focused domain. The complementary use of both datasets strengthens the robustness of the experimental conclusions.

Main Results¶

Reflective Translation produces consistent translation quality improvements across all prompting strategies and both models. Statistical significance test results are as follows:

Metric	Sample Size N	Median Gain	p-value	Effect Size (r)
BLEU	324	+0.0788	\(1.45 \times 10^{-44}\)	0.95
COMET	457	+0.1753	\(1.10 \times 10^{-65}\)	0.96

Using the Wilcoxon signed-rank test (a non-parametric paired test), improvements from first-pass to second-pass are extremely significant for both BLEU and COMET (p-values far below 0.001), with effect sizes (rank-biserial correlation) approaching 1.0, indicating that nearly all sentences benefit from the correction. This is not merely statistically reliable but practically meaningful.

The median COMET improvement (+0.1753) is approximately 2.2 times the median BLEU improvement (+0.0788), consistent with expectations—the reflection mechanism primarily addresses semantic-level errors (e.g., incorrect tense, confused noun class markers, semantic omissions), which COMET captures sensitively but which do not always manifest in n-gram precision matching.

Ablation Study¶

Prompting Strategy Comparison: The three prompting strategies exhibit distinct characteristics when combined with the reflection mechanism:

Prompting Strategy	BLEU Gain Stability	COMET Gain Stability	Overall Performance
Zero-shot	Moderate	High	Baseline effective
CoT-style	Moderate	High	Marginal difference from zero-shot
Few-shot	Highest	Highest	Most stable, most consistent gains

The few-shot + reflection combination produces the most stable gains. The authors suggest this is because in-context translation examples help the model establish more accurate translation expectations, leading to more precise self-critique and more reliable correction directions.

Confidence Threshold Ablation: A confidence threshold mechanism is introduced whereby the reflection-correction pipeline is applied only to sentences whose initial translation quality falls below a specified threshold:

Threshold Level	Coverage (eligible %)	Mean BLEU Gain	Mean COMET Gain
No threshold (all)	100%	Baseline	Baseline
Moderate threshold	Reduced	Increased	Increased
Strict threshold	Substantially reduced	Maximum	Maximum

Stricter thresholds restrict reflection-correction to the lowest-quality translations; while coverage decreases, the average improvement per corrected sentence increases. This indicates that Reflective Translation is fundamentally a targeted correction mechanism—most effective for cases where the initial translation quality is poor, with diminishing marginal returns for translations that are already of reasonable quality.

Key Findings¶

The effect of the reflection mechanism is model-agnostic: Consistent improvement patterns are observed on both GPT-3.5 and Claude Haiku 3.5—architecturally distinct LLMs—demonstrating that the framework's effectiveness does not depend on a specific model. This is of considerable practical importance for deployment.
COMET gains exceed BLEU gains: This strongly suggests that reflection primarily repairs semantic-level errors rather than performing simple lexical substitution. Through self-diagnosis, the model better preserves core source semantics and corrects tense/aspect markers and noun class agreement. COMET, as a learned metric, is more sensitive to such semantic improvements, while BLEU's n-gram matching mechanism is insensitive to synonymous substitution.
Few-shot + reflection is the optimal combination: In-context examples not only assist translation itself but also help the model evaluate itself more accurately, producing higher-quality reflection and correction. The hypothesized reason is that few-shot examples provide an implicit reference for "what a good translation looks like," making error identification more precise during the reflection stage.
The threshold mechanism reveals the targeted nature of reflection: Reflection yields the greatest improvements for low-quality translations and smaller improvements for already-adequate translations—consistent with the intuition that "the worse the translation, the more reflection is needed." This finding has strong engineering implications: in practical deployment, a lightweight quality assessment model (or even COMET itself) can be used to pre-screen low-quality translations and trigger the reflection pipeline only for those, achieving a better trade-off between translation quality and inference cost.
Effect sizes are extremely large (r ≈ 0.95–0.96): This means that not only do average metrics improve, but the overwhelming majority of individual samples—rather than a small subset of extreme cases—show actual improvement. In translation research, effect sizes approaching 1.0 on automatic metrics are exceptionally rare, indicating that the improvements due to the reflection mechanism are highly universal and consistent, rather than being inflated by a small number of dramatically improved cases.

Highlights & Insights¶

Theoretical framework for transferring "self-reflection" from reasoning to translation: The authors articulate a key conceptual bridge—translation is not merely "generation" but rather "constrained reasoning" (the target sentence must preserve source semantics). This framing allows methods such as Self-Refine and Reflexion to be naturally applied to the translation setting without fundamental modification. The abstraction is broadly transferable: any generation task where "output must satisfy certain constraints" (e.g., code generation must pass tests, summarization must preserve key facts) may benefit from structured self-reflection. This theoretical perspective of unifying translation and reasoning is itself a valuable contribution.
RAKE masking to prevent information leakage is a particularly elegant design: A frequently overlooked problem in self-reflection research is that the reflection text itself may already contain the correct answer. The authors resolve this issue at near-zero cost—requiring no training and no human annotation—through lightweight keyword extraction and masking. This technique is directly transferable to any LLM-based self-reflection generation task: code repair (mask specific code fragments in the reflection, retain error type descriptions), summarization revision (mask specific phrasings, retain structural improvement suggestions), fact verification (mask specific replacement facts, retain error type annotations), and more. The choice of RAKE is also judicious: as an unsupervised, statistically driven co-occurrence method, it requires no domain-specific resources and is particularly well-suited to low-resource settings.
The by-product value of the reflection-augmented dataset: The framework naturally produces (source text, first-pass translation, reflection, corrected translation) quadruples during operation. These data are valuable not only for analyzing patterns and regularities of reflection behavior (e.g., which error types are most frequently identified, which corrections are most effective), but also as supervised training data—distilling the reflection process into a smaller model so that it can generate "post-reflection quality" translations in a single forward pass. This "inference-time method → produce training data → distill" paradigm has become increasingly popular (cf. STaR, WizardMath, etc.) and represents an effective pathway for leveraging large-model reasoning capabilities to teach smaller models.
Rigor of the statistical analysis: The use of the Wilcoxon signed-rank test (non-parametric, paired) rather than a simple t-test, and the reporting of effect sizes (rank-biserial correlation) rather than p-values alone, is commendably rigorous against a backdrop of translation papers that typically report only average scores.

Limitations & Future Work¶

Limited language coverage: Evaluation is restricted to two Southern Bantu languages (isiZulu and isiXhosa), which are morphologically highly similar (sharing noun class systems and comparable verbal morphology). Whether conclusions generalize to low-resource languages with greater typological divergence (e.g., Tibeto-Burman, Polynesian, sign language translation) remains unclear. Languages with fundamentally different structures (e.g., SOV word order) may require distinct reflection templates.
Conservative model selection: Only two medium-scale commercial models—GPT-3.5 and Claude Haiku 3.5—are evaluated. Open-source models (e.g., LLaMA 3, Qwen 2, Mistral), larger-scale models (GPT-4, Claude Sonnet/Opus), and dedicated translation models (NLLB, SeamlessM4T, MADLAD-400) are not assessed. Of particular interest is whether the reflection mechanism remains effective across stronger or weaker models: intuitively, stronger models may already produce higher-quality initial translations, leaving less room for reflection-based improvement, while weaker models may lack sufficient "self-evaluation capacity" to identify their own errors—suggesting that the effectiveness of reflection may require the model to exceed some capability threshold for self-assessment. Furthermore, dedicated multilingual translation models (e.g., NLLB-200) may already outperform general-purpose LLMs on low-resource languages; whether Reflective Translation yields further improvements over such stronger baselines is an important open question.
Absence of human evaluation: Both BLEU and COMET are automatic metrics and may miss sociocultural nuances in translation (e.g., appropriate use of formal vs. informal registers, culturally specific expressions, accurate rendering of religious or folkloric terminology). For Bantu languages in particular, noun class errors can cause fundamental semantic shifts (e.g., applying a human noun class to a non-human entity) that COMET may not fully capture.
Reflection is fixed at a single round: The current framework performs only one reflection-correction cycle. Whether multiple rounds of reflection (as in the iterative improvement of Self-Refine) could yield further gains—or whether diminishing returns or degradation would set in—is an open question in the translation setting. In particular, multiple rounds of revision may cause the translation to drift from the source semantics ("over-correction"). Whether successive reflection rounds should target different granularities of error (e.g., round 1: semantic completeness; round 2: grammatical correctness; round 3: stylistic naturalness) is also a promising research direction. If multi-round reflection proves effective, a reliable stopping criterion for determining when "the translation is good enough and no further correction is needed" would also be required.
Coarseness of RAKE masking: RAKE is based on statistical co-occurrence and may extract insufficient keywords on short sentences (leaving masking inadequate) or over-mask on long sentences (causing excessive information loss from the reflection). More refined masking strategies—such as selective masking based on semantic role labeling—may yield better results.
Tripled inference cost: Each translation requires three LLM calls (initial translation + reflection + correction), resulting in API costs and latency approximately three times those of a single-pass translation. At scale, the threshold ablation approach provides a practical heuristic—applying the reflection pipeline only to low-confidence translations to balance quality and efficiency. Another potential optimization is merging the reflection and correction into a single prompt (having the model reflect and revise within the same call), though this may reduce the depth and quality of reflection. A further direction is using a lightweight model to evaluate initial translation quality and triggering the full reflection pipeline only for sentences below a quality threshold.
Failure modes of reflection are unexplored: Although average metrics improve, it is unclear whether cases exist where post-reflection translations are actually worse—and if so, what causes such degradation (incorrect reflection diagnosis, or correctly identified problem but wrong correction direction). Systematic analysis of failure cases would substantially deepen the paper's insight and practical guidance. Moreover, the quality of the reflection text itself merits analysis: are model-generated reflections consistently accurate, or do "spurious reflections" occur (the model fabricates non-existent errors and revises accordingly)?

vs. Self-Refine (Madaan et al., 2023): Self-Refine proposes a general "generate → feedback → refine" iterative framework, but its feedback consists of unstructured free text. The key improvement in Reflective Translation is the introduction of a structured reflection template (error identification + correction guidance + critical content), making the reflection process more controllable and analyzable. Additionally, while Self-Refine allows multiple iterations, Reflective Translation currently performs only one round but addresses the information leakage problem—not specifically handled by Self-Refine—via RAKE masking.
vs. Reflexion (Shinn et al., 2023): Reflexion embeds self-reflection within a reinforcement learning framework, using external environmental feedback (e.g., code execution results) to guide reflection. Reflective Translation relies entirely on internal reflection (no external feedback signal), making the framework more lightweight but also more dependent on the model's own evaluation capability. In the translation setting, incorporating external feedback (e.g., round-trip consistency checking, bilingual dictionary lookup) may represent a valuable enhancement.
vs. ReflectionLLMMT (Wang et al., 2024): This is a contemporaneous work on translation reflection with many similarities in pipeline design. Key differences are that the present work (a) focuses on low-resource African languages rather than high-resource language pairs, (b) introduces the RAKE masking mechanism to prevent information leakage, and (c) conducts rigorous statistical significance testing with effect size reporting. The two works are complementary and together point toward a clear trend: inference-time self-reflection is emerging as a standard tool for translation quality improvement. ReflectionLLMMT's reflection emphasizes multi-dimensional scoring (fluency, accuracy, etc.), while Reflective Translation's reflection emphasizes error diagnosis—the latter may be more targeted in low-resource settings.
vs. Chain-of-Thought (Wei et al., 2022): CoT requires the model to display its reasoning process before generating a final answer, but the reasoning is "forward"—guiding initial generation without retrospective correction. Reflective Translation can be viewed as a "backward extension" of CoT: not only reasoning before generation, but also reviewing and correcting after generation. The experimental finding that CoT-style prompting combined with reflection yields only moderate improvement—less than few-shot + reflection—suggests some degree of functional overlap between CoT and reflection.
Connection to broader low-resource NLP directions: The core philosophy of this paper—leveraging a model's existing knowledge for self-improvement at inference time—aligns with an important shift in the low-resource NLP community: from "acquiring more data" to "better utilizing existing model capabilities." This paradigm is equally applicable to other low-resource tasks beyond translation (e.g., low-resource NER, low-resource text classification, post-editing of low-resource ASR outputs). More broadly, any setting where LLM generation quality is suboptimal but the model possesses self-evaluation capability is a potential application domain for the reflection framework.

Rating¶

Novelty: ⭐⭐⭐ The core idea (self-reflection to improve generation quality) builds on prior work; the contributions lie in systematically applying it to low-resource translation and introducing the RAKE masking mechanism. Incremental innovation.
Experimental Thoroughness: ⭐⭐⭐ Statistical analysis is rigorous (non-parametric testing + effect sizes), but language coverage is limited (only 2 languages), model selection is conservative (only 2 models), and human evaluation is absent.
Writing Quality: ⭐⭐⭐⭐ Structure is clear and argumentation is logically rigorous; mathematical notation is used appropriately; full prompt templates are provided in the appendix, enhancing reproducibility.
Value: ⭐⭐⭐ The method is lightweight, plug-and-play, and model-agnostic, with practical applicability to low-resource translation; the reflection-augmented dataset is a valuable by-product; however, the limited experimental scale constrains the generalizability of the conclusions.