SeqPO-SiMT: Sequential Policy Optimization for Simultaneous Machine Translation¶

Conference: ACL 2025
arXiv: 2505.20622
Code: None
Area: Multilingual Translation
Keywords: Simultaneous Machine Translation, Policy Optimization, GRPO, Multi-step Decision Making, Delay-Quality Trade-off

TL;DR¶

Modeling simultaneous machine translation (SiMT) as a multi-step sequential decision-making problem, this paper proposes the SeqPO-SiMT policy optimization framework. By fusing reward signals of translation quality and latency, it achieves performance on a 7B LLM that is comparable to strong offline translation models.

Background & Motivation¶

Simultaneous machine translation (SiMT) generates translations in real time as the source text stream arrives, which is widely applied in scenarios such as simultaneous interpretation. LLM-based SiMT approaches typically employ Supervised Fine-Tuning (SFT) on partial translation data, but they suffer from the following issues:

Poor SFT Data Quality: Partial translation data is often generated by heuristics or attention alignment tools, introducing significant noise.

Incompatible RLHF Methods: Existing methods like PPO and DPO are primarily designed for single-step tasks. In contrast, SiMT is a multi-step sequential decision-making process—at each step, the model receives new source text chunks, decides whether to translate or wait, and previous translations affect subsequent outcomes.

Complex Multi-step Dependencies: Since the source text arrives incrementally, ambiguity is common (e.g., "bark" might require the subsequent word "tree" to disambiguate). Mis-translations in previous steps can cascade and degrade the overall output quality.

Core Motivation: There is a need for a policy optimization method that can model multi-step dependencies while concurrently optimizing both quality and latency.

Method¶

Overall Architecture¶

SeqPO-SiMT defines SiMT as a sequential decision-making process:

Environment: The source sentence $\mathbf{x}$ is divided into $T$ chunks (each containing $m$ words) and released incrementally.
Policy: The LLM serves as the policy model $\pi_\theta$. At each step $t$, it generates translation segments based on the available source text and previous translation history.
Sampling: Multiple trajectories ($B$ complete translation trajectories) are sampled through multi-step generation.
Reward: Rewards are computed at the final step based on both quality and latency.
Optimization: Optimization is performed using Group Relative Policy Optimization (GRPO) policy gradients.

Generation process at each step: $$\hat{y}_t \sim \pi_\theta(\hat{y}_t | x_1, \cdots, x_t, \hat{y}_1, \cdots, \hat{y}_{t-1})$$

Key characteristic: The model autonomously determines whether to translate and how much content to translate (the translation length is flexible, and the model can choose to wait for subsequent context).

Key Designs¶

Key Design 1: Blended Reward¶

Translation quality and latency have different scales and exhibit a trade-off. The authors design a normalized and truncated fused reward:

Quality Normalization: $$q^i = \frac{\hat{q}^i - \text{mean}(\{\hat{q}^1, \cdots, \hat{q}^B\})}{\text{std}(\{\hat{q}^1, \cdots, \hat{q}^B\})}$$

Latency Normalization and Truncation: $$L^i = \max\left(m, \frac{\hat{L}^i - \text{mean}(\{\hat{L}^1, \cdots, \hat{L}^B\})}{\text{std}(\{\hat{L}^1, \cdots, \hat{L}^B\})}\right)$$

The truncation threshold $m$ (chunk size) prevents the model from overfitting to the latency metric.

Final Reward: $r_T^i = \lambda q^i - L^i$, where $\lambda=2$ balances quality and latency.

Key Design 2: GRPO Optimization¶

Reasons for choosing GRPO over PPO: 1. Resource Efficiency: GRPO uses the average within a group as the baseline, whereas PPO requires an additional critic model (SiMT would require two critics for quality and latency, leading to unaffordable memory costs). 2. Accuracy: Latency is a rule-based metric, and a neural reward model in PPO would introduce unnecessary noise.

Loss & Training¶

The objective function integrates the reward and a KL constraint:

\[J(\pi_\theta) = \mathbb{E}\sum_{t=1}^{T}\left[r_T - \beta \log \frac{\pi_\theta(\hat{y}_t | x_{1:t}; \hat{y}_{1:t-1})}{\pi_{\text{ref}}(\hat{y}_t | x_{1:t}; \hat{y}_{1:t-1})}\right]\]

Training pipeline: 1. Warm up with SFT on 40K samples first (constructing partial translation data). 2. Optimize using SeqPO. 3. Use COMET for the quality reward and Average Lagging (AL) for the latency reward. 4. The backbone model is Qwen-2.5-7B, with $B=5$, using $\beta=0.02$ for En→Zh and $\beta=0.1$ for Zh→En.

Key Experimental Results¶

Main Results¶

Detailed Zh→En SiMT results (extracted from Table 2, low-latency setting):

Dataset	Method	BLEURT↑	COMET↑	GPT-4↑	AL↓
REALSI	SFT	64.14	83.49	83.24	15.10
REALSI	SFT+wait-k	59.37	79.60	78.90	16.75
REALSI	SeqPO-SiMT	65.93	84.23	85.49	14.14
NEWS	SFT	65.01	84.34	86.02	10.18
NEWS	SeqPO-SiMT	66.67	85.17	87.67	9.29

En→Zh SiMT results (extracted from Table 3, low-latency setting):

Dataset	Method	BLEURT↑	COMET↑	GPT-4↑	AL↓
MUSTC	SFT	65.84	86.75	91.84	5.71
MUSTC	SeqPO-SiMT	66.76	87.55	92.10	5.00
NEWS	SFT	61.12	85.54	90.99	5.02
NEWS	SeqPO-SiMT	63.37	87.41	91.63	4.43

Average COMET improvement: +1.3 for low latency, +1.25 for high latency (a COMET improvement of 1 point is generally considered a significant improvement). On NEWS En→Zh, COMET increased by +1.13 while AL decreased by -6.17.

Comparison with Offline Translation¶

Comparison of SeqPO-SiMT (SiMT) vs. strong models (offline translation) (extracted from Table 4):

Model	Mode	COMET
Qwen2.5-7B-Instruct	Offline	86.49
LLaMA3-8B-Instruct	Offline	83.78
SFT	Offline	86.94
SeqPO-SiMT	SiMT	87.55

SeqPO-SiMT under the SiMT setting (viewing only partial source text) outperforms the offline translation results of multiple models (viewing the complete source text).

Ablation Study¶

Training Dynamics Analysis (Figure 4a): During training, AL drops rapidly first (the model quickly learns to reduce latency), followed by a gradual increase and stabilization of COMET. The model fits the simpler latency objective first before optimizing the translation quality.

Quality-Only Optimization Comparison (Figure 4b): SeqPO-SiMT achieves lower latency at the same translation quality, proving that the fused reward successfully improves both quality and latency concurrently rather than trading one off for another.

Key Findings¶

Traditional RLHF (PPO/DPO) cannot effectively model the multi-step dependencies in SiMT.
The normalized and truncated reward fusion strategy successfully balances quality and latency.
GRPO is more suitable for the SiMT scenario than PPO (resource efficiency + metric accuracy).
SiMT can reach the level of offline translation given sufficient policy optimization.

Highlights & Insights¶

Multi-step RL Modeling for SiMT: Reformulates SiMT from an SFT data quality problem to a policy optimization problem, bypassing the limitations of alignment tools.
Exquisite Reward Design: Normalization unifies the metric scales, truncation prevents overfitting to latency, and $\lambda$ controls the trade-off.
Rational Policy Choice: Selecting GRPO circumvents the memory issues associated with dual critics in PPO, utilizing the fact that latency is a rule-based metric.
Impressive Experimental Results: SiMT (with partial context) manages to perform on par with or even exceed offline translation (with full context).

Limitations & Future Work¶

Only verified on Chinese-English bilingual translation, without evaluating performance on more diverse language pairs.
The chunk size $m$ is a fixed hyperparameter; adaptive chunking may yield better results.
Training requires reference translations to calculate the COMET reward, limiting its applicability to unsupervised scenarios.
Inference latency and throughput are not reported; multi-step sampling might impact training efficiency.
The wait-k baseline is relatively weak, lacking comparisons with more advanced SiMT methods (e.g., adaptive wait).

GRPO / DeepSeek-R1 (Shao et al., 2024; DeepSeek-AI, 2025): Represents the foundation of the optimization method in SeqPO.
SiMT with LLMs (Cheng et al., 2024; Koshkin et al., 2024): Defines the SFT paradigm of LLM-based SiMT, upon which SeqPO incorporates policy optimization.
Wait-k (Ma et al., 2019): A classic SiMT strategy and one of the baselines used in this paper.

Inspiration & Association¶

Rating¶

Novelty: 4/5 — Introduces multi-step RL into LLM optimization for SiMT, presenting a fresh perspective.
Technical Depth: 4/5 — Well-thought-out reward design, optimization selection, and training strategies.
Experimental Thoroughness: 4/5 — Evaluates on 6 datasets with multiple metrics, offline comparisons, and training dynamics analyses.
Practical Value: 3/5 — Effective method, but training carries high costs, and the code is not open-sourced.