InfiniSST: Simultaneous Translation of Unbounded Speech with Large Language Model¶

Conference: ACL 2025
Authors: Siqi Ouyang, Xi Xu, Lei Li (CMU)
arXiv: 2503.02969
Code: GitHub
Area: LLM/NLP, Speech Translation
Keywords: simultaneous translation, streaming speech, KV cache, multi-turn dialogue, unbounded input

TL;DR¶

This paper proposes InfiniSST, which models unbounded streaming speech simultaneous translation as an LLM multi-turn dialogue task, combining robust segment training data construction, multi-delay augmentation strategies, and Λ-shaped KV cache management to reduce computation-aware latency by 0.5-1 second on MuST-C En-Es/De/Zh directions without losing translation quality.

Background & Motivation¶

Core Problem: Simultaneous translation (SST) must balance quality and latency, but existing methods mostly assume pre-segmented speech (SST-S) and cannot handle continuous, unbounded streaming speech input in real-world scenarios (SST-U).
Limitations of Prior Work: Conventional LLM-based SST-S methods recalculate historical speech and generated translation features whenever a new speech chunk arrives, incurring massive computational overhead. While some works model SST as multi-turn dialogue to leverage KV cache (e.g., Yu et al., 2025; Wang et al., 2024), these approaches target segmented speech and cannot extend seamlessly to unbounded inputs.
Design Motivation: LLMs possess powerful long-context modeling capabilities (RoPE positional encoding + attention window), making them an ideal foundation for SST-U. The key challenges are: (1) how to construct training data suitable for unbounded speech; (2) how to manage the KV cache during inference to achieve constant memory usage.

Method¶

Overall Architecture¶

InfiniSST consists of three core components: 1. Streaming Speech Encoder (modified wav2vec2): Incrementally computes speech representations to avoid redundant computation. 2. Speech-to-Token Embedding Adapter: A two-layer 1-D convolution (kernel=2, stride=2) + linear projection, compressing 48 frames into 12 LLM embedding vectors. 3. Multi-turn LLM Decoder (Llama-3.1-8B-Instruct): Alternately reads speech inputs and generates translations, controlling read-write switching via EOT tokens.

Inference workflow: System Instruction → Loop{USER token + 12 speech embeddings + EOT → ASSISTANT generates translation → switches back to reading upon encountering EOT}.

Key Designs¶

Streaming Speech Encoder Modification: Replaces the bidirectional attention of wav2vec2 with chunk-wise causal attention (bidirectional within a chunk, causal across chunks), replaces convolutional positional encoding with RoPE, and adds a sliding window of $w^s=10$ chunks (approx. 9.6 seconds of context). Each chunk consists of 48 frames, lasting 960ms.
Training Data Construction: (a) Uses MFA forced alignment + SimAlign to establish a monotonic mapping of speech → transcription → translation; (b) segments talks into 30-chunk "robust segments" to include non-speech sounds for enhanced robustness; (c) multi-delay augmentation—randomly selects a delay multiplier $m \in [1, M]$ to merge successive chunks and their translations, where $M=12$ .
KV Cache Management Strategy: During inference, the LLM maintains the KV cache of the system instruction + the KV cache of the most recent $w^t=1000$ tokens. No positional information is embedded in the stored KV values (RoPE is removed before storage), and RoPE is reapplied after concatenation. This achieves infinite length extrapolation using a Λ-shaped attention window.

Loss & Training¶

Standard cross-entropy loss (applied only to translation tokens and EOT). Two-stage training: (1) Freeze the LLM, train the encoder + adapter for 6 epochs (lr=2e-4); (2) Freeze the encoder + adapter, train the LLM for 1 epoch (lr=7e-6). Single-node training on 8×L40S GPUs.

Experiments¶

Main Results——MuST-C Full TED Talks (27 talks, 3-23 minutes)¶

Comparison	Computation-Aware Latency (StreamLAAL_CA)	Non-Computation-Aware Latency (StreamLAAL)	Translation Quality
InfiniSST vs StreamAtt+	Reduced by 0.5-1 seconds	Comparable or slightly better	BLEU comparable or 0.5-1.0 higher
InfiniSST RTF vs StreamAtt+ RTF	Less than half	—	—

When StreamLAAL $\le$ 1.5s, InfiniSST achieves slightly higher BLEU (0.5-1.0) in all three language directions, with comparable COMET.

Ablation Study¶

Ablation Component	Results
Robust Segments vs Original Segments	Without robust segments, the model gets stuck in repetitions or stops translating when encountering non-speech sounds like laughter, with COMET degrading from 69.2 to 50.5.
Multi-delay Augmentation M=1 vs 12	A larger $M$ yields a better quality-latency trade-off, but inference $m$ should not exceed $M$.
Speech Window $w^s$ =5/10/20/40	Quality degrades when the inference window does not match training (66.1 vs 69.2). It is recommended to use the maximum allowable window.
LLM Window $w^t$ =500→4000	COMET only increases from 69.0 to 69.4, indicating LLM robustness to window size.
Instruction KV Cache Removal	The LLM stops translating after the window slides, indicating that the instruction cache is indispensable.
Llama-3 (8K ctx) vs Llama-3.1 (128K)	The 8K model still generalizes to >10 min talks (COMET 67.1 vs 68.0).

Key Findings¶

Robust segment training is crucial for adapting the model to unbounded speech—without it, the model completely fails to handle non-speech sounds.
KV cache management reduces RTF to less than half of StreamAtt+, which is the core of computational efficiency improvements.
Even if the base LLM context is only 8K, it can still handle unbounded speech far exceeding 8K tokens through KV cache management.

Highlights & Insights¶

Modeling unbounded streaming SST as a multi-turn dialogue is an elegant abstraction that perfectly leverages the LLM's KV cache mechanism.
The first successful application of the Λ-shaped attention window in speech translation.
The meticulously designed training data construction (robust segments + multi-delay augmentation) is essential for model generalization.

Limitations & Future Work¶

Still lags behind AlignAtt and StreamAtt at high theoretical latency levels (chunk-wise causal attention restricts bidirectional information).
Evaluated only in the En-X direction, without testing X-En and X-X.
Did not explore other speech encoders and non-Llama LLMs due to computational budget constraints.
StreamLAAL metrics are not entirely reliable due to mWERSegmenter alignment errors.
Did not perform human evaluation of translation quality.

Cascaded SST: Pipeline approach of ASR segmentation + MT translation (Fugen et al., 2006), suffering from error propagation.
End-to-End SST-U: AlignAtt/StreamAtt extended to unbounded speech (Papi et al., 2024a), but requiring full history storage.
LLM Length Extrapolation: RoPE (Su et al., 2021), Λ-shaped attention window (Han et al., 2024; Xiao et al., 2024).

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐
Technical Depth	⭐⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Value	⭐⭐⭐⭐⭐