DRT: Deep Reasoning Translation via Long Chain-of-Thought¶

Conference: ACL 2025
arXiv: 2412.17498
Code: https://github.com/krystalan/DRT (Yes, including data and models)
Area: LLM Reasoning
Keywords: Machine Translation, Chain-of-Thought, Literary Translation, Multi-Agent, Metaphor Translation

TL;DR¶

This work introduces long CoT reasoning into machine translation by establishing a multi-agent framework (Translator \(\to\) Advisor \(\to\) Evaluator) to iteratively refine literary translations containing metaphors and similes. It synthesizes a 22K long-thought translation training dataset, and the resulting DRT-14B model outperforms large models such as QwQ-32B and DeepSeek-R1-Distill-32B in literary translation.

Background & Motivation¶

Background: While o1-like models have achieved breakthroughs in math and coding reasoning through long CoT, the value of long-thought reasoning has not been systematically explored in the field of translation. Existing translation models mostly rely on single-pass generation, lacking an iterative refinement process.

Limitations of Prior Work: (a) Literary sentences containing metaphors and similes cannot be properly translated via literal translation, requiring instead a deep understanding of rhetorical intent followed by free translation; (b) General DeepSeek-R1 distilled models suffer from incompatible reasoning capabilities when applied to translation tasks (e.g., DeepSeek-R1-Distill-Qwen-7B achieves a GEA score of only 43.66); (c) There is a lack of large-scale, long-thought translation training data.

Key Challenge: Literary translation requires deep semantic understanding and multi-step deliberation, but the one-step generation paradigm of existing models cannot capture this iterative refinement process.

Goal: (a) To construct a long-thought training dataset for literary translation scenarios; (b) To train a specialized DRT model, enabling small models to achieve high-quality literary translation.

Key Insight: Translating metaphors and similes in literature naturally requires multi-step reasoning: "comprehend \(\to\) attempt \(\to\) receive criticism \(\to\) improve", which aligns perfectly with the essence of long-thought reasoning.

Core Idea: Generate long-thought training data for literary translation using multi-agent iterative refinement, and then fine-tune LLMs to internalize this iterative deliberation process.

Method¶

Overall Architecture¶

A three-step pipeline: (1) Literary Mining: Filtering 63K sentences featuring metaphors/similes that are poorly translated literally from 400 English books in Project Gutenberg; (2) Multi-Agent Iterative Refinement: Collaboration among a Translator, Advisor, and Evaluator to progressively improve translations until they meet quality standards; (3) Long-thought Reorganization: Eliminating invalid iterations and utilizing GPT-4o to restructure multi-agent dialogues into coherent self-reflection narratives, ultimately yielding 22,264 training samples.

Key Designs¶

Literary Sentence Filtering (Two-Stage Filtering):
- Function: Filtering sentences requiring deep translation from a corpus of 577.6K sentences.
- Mechanism: Utilizing Qwen2.5-72B-Instruct to make two judgments—Q1: Does it contain metaphors/similes? (Keep "yes") Q2: Can a literal translation satisfy a native speaker? (Keep "no").
- Design Motivation: Only sentences that truly require deliberation are worth applying long-thought reasoning to, whereas ordinary sentences can be translated literally.
Multi-Agent Iterative Refinement:
- Function: Three agents collaborate to iteratively improve translations.
- Mechanism: Phase 1 – The Translator performs keyword-level translation (decomposing sub-problems); Phase 2 – Generating the initial full translation \(t^0\); Phase 3 – Iterative loop: the Advisor evaluates \(t^{k-1}\) and provides feedback \(f^{k-1}\), the Evaluator assigns a score \(s^{k-1}\), and the Translator generates a refined translation \(t^k\) based on the feedback and score. This loop stops when the score threshold or maximum iteration count is reached.
- Design Motivation: Simulating the deliberation process of human translators—initial translation \(\to\) review \(\to\) revision \(\to\) re-review.
Long-thought Reorganization:
- Function: Converting multi-agent dialogues into a coherent, long-thought format that a single model can learn.
- Mechanism: Eliminating invalid iteration rounds where the score does not improve, filtering out samples with fewer than 3 valid refinement steps, and using GPT-4o to rewrite multi-agent dialogues into coherent self-reflection narratives. The final output is chosen as the translation with the highest score (not necessarily the last round).
- Design Motivation: The multi-agent dialogue format is unsuitable for SFT of a single model, requiring standardization into a "thought process \(\to\) final translation" format.

Dataset Statistics¶

22,264 samples (19,264 training / 1,000 validation / 2,000 testing)
Average thought tokens: 527.64, average refinement steps: 4–5 steps
73.22% of samples contain at least 3 refinement steps

Loss & Training¶

Full-parameter SFT based on Qwen2.5-7-7B/14B and LLaMA-3.1-8B.
Utilizing the LLaMA-Factory framework, with vLLM accelerating inference.

Key Experimental Results¶

Main Results (English \(\to\) Chinese Literary Translation)¶

Model	GEA ↑	GRF ↑	CometKiwi ↑	BLEU ↑
Qwen2.5-14B-Instruct	70.86	84.74	72.01	30.23
QwQ-32B-Preview	75.50	86.31	71.48	27.46
DeepSeek-R1-Distill-Qwen-32B	71.88	84.78	71.93	29.36
Qwen2.5-14B-SFT (w/o CoT)	74.53	85.66	72.08	37.63
DRT-14B	77.41	87.19	72.11	36.46

DRT-14B significantly outperforms 32B-scale models on reference-free metrics (GEA, GRF).

Human Evaluation (200 samples, Best-Worst Scaling)¶

Model	Fluency	Semantic Accuracy	Literariness
Qwen2.5-14B-Instruct	-0.353	-0.363	-0.442
QwQ-32B-Preview	-0.063	0.022	-0.007
Qwen2.5-14B-SFT	0.103	0.108	0.087
DRT-14B	0.313	0.233	0.362

DRT-14B leads substantially in literariness (0.362 vs. 0.087/0.007).

Ablation Study¶

Configuration	GEA	Description
DRT-7B	75.05	Full model
Qwen2.5-7B-SFT	72.29	w/o long-thought, drop 2.76
DRT-14B	77.41	Full model
Qwen2.5-14B-SFT	74.53	w/o long-thought, drop 2.88

Key Findings¶

Long-thought reasoning significantly improves reference-free metrics but may lower BLEU: DRT-14B achieves GEA +2.88 but BLEU -1.17, because reasoning leads to a more liberal translation that deviates from the reference translation.
Evaluator accuracy of 92.5%: Far exceeding CometKiwi (56%), demonstrating the effectiveness of LLM-as-evaluator in literary translation evaluation.
Inference cost: 12× slower: Long-thought translation is 11.9–13.9 times slower than standard translation, making it suitable only for scenarios with high-quality demands.
Diminishing returns in refinement steps: The largest modification occurs from Step 0 \(\to\) 1 (21.44 characters), with subsequent steps showing progressively smaller changes.

Highlights & Insights¶

The o1 Paradigm in Translation: This work is the first to systematically introduce long CoT reasoning into translation, demonstrating that reasoning is indeed beneficial in literary translation scenarios that require deep semantic understanding.
Multi-agent Dialogue to Single-model Long-thought Data Conversion: Generating high-quality refinement processes using multi-agents first, and then reorganizing them into formats trainable for a single model. This data-synthesis paradigm can be transferred to other tasks requiring iterative refinement (such as academic writing and code review).
14B Model Outperforming 32B: DRT-14B exceeds QwQ-32B and DeepSeek-R1-Distill-32B across multiple metrics, indicating that domain-specific long-thought data is more crucial than general reasoning capabilities.

Limitations & Future Work¶

English-to-Chinese only: Other language pairs have not been evaluated.
Applicable only to literary translation: Ordinary translation does not require long-thought reasoning, making the 12× inference cost impractical.
Unreliable automatic evaluation: BLEU and COMET exhibit low correlation for literary translation, leading to a heavy reliance on human evaluation.
High data-synthesis cost: Involves multi-agent iteration combined with GPT-4o restructuring.
Potential improvements: (a) Expanding to other translation scenarios requiring deliberation (e.g., legal or medical); (b) Training a lightweight evaluator to replace the 72B model.

vs. Marco-O1: Marco-O1 is a general o1-style reasoning model and performs moderately on translation (GEA 64.24); in contrast, DRT employs specialized data and training tailored for translation.
vs. DeepSeek-R1 Distillation: General-purpose reasoning distillation models even underperform compared to the baselines on translation tasks (e.g., DeepSeek-R1-Distill-Qwen-7B achieves a GEA of only 43.66), indicating that reasoning capabilities cannot be directly transferred to translation.
vs. GPT-4o: GPT-4o (GEA 71.88) < DRT-14B (77.41), demonstrating that domain-specialized smaller models can outperform general large models.

Rating¶

Novelty: ⭐⭐⭐⭐ First work to systematically apply long CoT reasoning to translation, featuring a cleverly designed data-synthesis pipeline.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dimensional validation combining automatic metrics, human evaluation, ablation studies, and comparisons with commercial models.
Writing Quality: ⭐⭐⭐⭐ Clear data-synthesis process and comprehensive evaluation.
Value: ⭐⭐⭐⭐ Practical contributions to both literary translation and the long-thought paradigm, with both data and models open-sourced.