ACL 2025 Multilingual & Machine Translation Interpreting Post-Editing Context Augmentation Token-Level Editing Simultaneous Interpreting Automatic Error Correction

Context Augmented Token-Level Post-Editing for Human Interpreting¶

Conference: ACL 2025
Area: Multilingual Translation
Keywords: Interpreting Post-Editing, Context Augmentation, Token-Level Editing, Simultaneous Interpreting, Automatic Error Correction
Code: None

TL;DR¶

This paper proposes a context-augmented token-level post-editing method that leverages dialogue context information for fine-grained error correction of automatic speech recognition (ASR) transcripts in human interpreting. It significantly improves transcription quality while maintaining the fluency of the interpreting.

Background & Motivation¶

Background: Human interpreting is a core component of cross-lingual communication. Interpreters inevitably introduce grammatical errors, omissions, and disfluencies during real-time translation. Although current automatic speech recognition (ASR) systems can convert interpreted speech to text, the cascading errors of ASR outputs combined with the interpreter's own disfluency make the final transcription quality insufficient for downstream applications.

Limitations of Prior Work: Traditional automatic post-editing (APE) methods mostly operate at the sentence level, rewriting or editing the entire sentence. This coarse-grained approach faces two challenges: (1) high computational overhead, making it unsuitable for real-time or near-real-time scenarios; and (2) a tendency to over-edit, altering the interpreter's original expressive style and intent. Existing token-level methods, although more efficient, do not fully utilize context information, failing to distinguish among ASR errors, interpreting errors, and intentional simplifications.

Key Challenge: Token-level editing aims for precise local corrections, but identifying interpreting errors often requires broader context—the same token might be erroneous or correct depending on the context. Generating sufficient context information while maintaining token-level efficiency remains a critical challenge.

Goal: To design a context-aware token-level post-editing framework capable of: (1) precisely identifying tokens that require editing; (2) leveraging dialogue history and source-language information to make correct editing decisions; and (3) preserving the interpreter's personal expressive style.

Key Insight: The authors observe that error types in interpreting are highly context-dependent—terminology errors require referring to the source language, grammatical errors require local context, and omissions require referring to the dialogue history. Consequently, they propose injecting multi-level context information into the token-level editing decisions.

Core Idea: By utilizing a context-augmented token classifier, the method fuses dialogue history, source-language segments, and local grammatical constraints into token-level editing signals, enabling precise yet conservative corrections of interpreting transcripts.

Method¶

Overall Architecture¶

The system takes ASR transcripts and optional source-language text as input and outputs corrected transcripts. The overall pipeline consists of three phases: (1) the context encoding stage, which separately encodes the current sentence, dialogue history, and source-language details; (2) the token-level editing decision stage, which predicts editing operations (KEEP/REPLACE/DELETE/INSERT) for each token; and (3) the editing execution stage, which generates the final output according to the predicted operations.

Key Designs¶

Multi-level Context Encoder:
- Function: Encodes context information at different granularities into a unified representation.
- Mechanism: Uses pre-trained language models to separately encode the current sentence, the previous \(k\) turns of dialogue history, and corresponding source-language segments. Cross-attention is employed to fuse these three types of information. The representation of each token in the current sentence dynamically aggregates relevant information from the history and the source language based on attention weights.
- Design Motivation: Identifying interpreting errors requires diverse contextual clues—terminology errors rely on source-language alignment, grammatical errors rely on local context, and omissions rely on dialogue history. The multi-level design allows the model to selectively leverage different contexts depending on the error type.
Token-level Editing Classifier:
- Function: Predicts the most appropriate editing operation for each token.
- Mechanism: A classification head is added on top of the fused context representation, defining editing operations as four distinct classes: KEEP, REPLACE (replace with target token), DELETE (delete), and INSERT (insert after this position). For REPLACE and INSERT operations, an additional generation head is utilized to predict the target token. The classification loss uses weighted cross-entropy to place higher weight on minority classes (DELETE, INSERT).
- Design Motivation: Compared to sentence-level rewriting, token-level operations are more precise and preserve the interpreter's original expressions to the maximum extent. The four classes of operations cover the most common error patterns in interpreting.
Conservative Editing Strategy:
- Function: Controls the aggressiveness of edits to avoid over-correction.
- Mechanism: Introduces an editing confidence threshold \(\tau\), so edits are executed only when the model's confidence for an editing operation exceeds \(\tau\). Additionally, an editing consistency constraint is designed to require semantic consistency in editing actions for consecutive tokens (e.g., avoiding deleting only half of a dependent clause). Adjusting \(\tau\) enables trade-offs between editing adequacy and conservatism.
- Design Motivation: The core challenge of post-editing interpreting is "edit or not"—being too aggressive may distort the original meaning, whereas being too conservative fails to correct errors. The confidence threshold furnishes a controllable editing strategy.

Loss & Training¶

The training employs a multi-task learning framework: the main task is token-level editing operation classification (weighted cross-entropy loss), and auxiliary tasks include target token generation (language modeling loss for REPLACE and INSERT operations) and edit location detection (binary classification loss). The total loss is a weighted sum of the three: \(L = L_{edit} + \lambda_1 L_{gen} + \lambda_2 L_{detect}\). Training data labels are automatically generated by aligning interpreting transcripts with reference translations.

Key Experimental Results¶

Main Results¶

Dataset	Metric	Ours	Sentence-level APE	No Edit Baseline	Gain
EPIC-Interpreting (en-it)	TER↓	28.3	30.1	38.7	+26.9% vs. Baseline
EPIC-Interpreting (en-it)	BLEU↑	52.4	49.8	42.1	+24.5% vs. Baseline
Europarl-Interpreting (en-de)	TER↓	31.5	33.2	41.3	+23.7% vs. Baseline
Europarl-Interpreting (en-de)	BLEU↑	48.7	46.3	39.5	+23.3% vs. Baseline

Ablation Study¶

Configuration	TER↓	BLEU↑	Description
Full model	28.3	52.4	Full model
w/o Dialogue history	30.8	50.1	Without historical context, decreases by 2.5 TER
w/o Source language	29.6	51.2	Without source language information, decreases by 1.3 TER
w/o Conservative strategy	27.9	51.8	TER is slightly better but BLEU drops, indicating over-editing
Sentence-level substitution	30.1	49.8	Degenerates to sentence-level APE

Key Findings¶

Dialogue history context contributes the most; removing it degrades TER by 2.5 points, indicating that evaluating interpreting errors is heavily dependent on context.
Source-language information benefits the correction of terminology-related errors most, showing more significant results in specialized interpreting domains (e.g., medical, legal).
Although the conservative editing strategy suffers a slight drop in TER, it achieves higher "intent preservation" scores in human evaluations.

Highlights & Insights¶

Merging context augmentation with token-level editing is highly elegant—it preserves the efficiency and precision of token-level approaches while gaining the global perspective of sentence-level approaches through the attention mechanism.
The design of the conservative editing strategy reflects a profound understanding of interpreting scenarios: unlike generic APE, post-editing for interpreting needs to respect the professional judgment of the interpreters.
The four-class editing operation framework can be migrated to other text error-correction tasks, such as grammatical error correction and OCR post-processing.

Limitations & Future Work¶

The proposed method depends on high-quality ASR outputs and has limited capability in handling errors directly introduced by ASR.
Utilizing source-language information assumes that aligned source-language texts are available, which may not hold true in certain interpreting scenarios.
The choice of dialogue history window size needs tuning for different scenarios, which is not thoroughly discussed as an optimal window strategy in this paper.
Future work could explore end-to-end speech-to-edited-text solutions to bypass the intermediary ASR phase.

vs. Traditional APE (Chatterjee et al.): Traditional APE rewrites at the sentence level, whereas this work performs precise corrections at the token level, showing advantages in efficiency and intent preservation.
vs. LaserTagger: LaserTagger also performs token-level editing but lacks context awareness. This work significantly improves editing accuracy via multi-level context encoding.
vs. GECToR: GECToR is designed for grammatical error correction, whereas this work targets post-editing for human interpreting, employing a more conservative editing strategy that focuses more on preserving original expressions.
vs. Translatotron: End-to-end speech translation bypasses the ASR stage, yet interpreting computer-aided scenarios still require post-editing to correct the outputs of human interpreters.

Rating¶

Novelty: ⭐⭐⭐⭐ Integrating context augmentation with token-level editing for interpreting post-editing is a novel perspective.
Experimental Thoroughness: ⭐⭐⭐ Core experiments are sufficient, but the variety of datasets is limited.
Writing Quality: ⭐⭐⭐⭐ Clear motivation, and the description of the methodology is comprehensive.
Value: ⭐⭐⭐⭐ Offers practical value to the fields of human interpreting and speech translation.