Latent Speech-Text Transformer¶
Conference: ICLR 2026 Oral arXiv: 2510.06195 Code: GitHub Area: Audio & Speech Keywords: speech-text modeling, latent patches, autoregressive, ASR, TTS, cross-modal alignment, BLT
TL;DR¶
This paper proposes the Latent Speech-Text Transformer (LST), which aggregates discrete speech tokens into higher-level "latent speech patches" as autoregressive units (analogous to BLT's treatment of bytes), aligning the sequence modeling granularity of speech and text (reducing the length ratio from 20× to ~1:1). LST achieves +6.5% absolute improvement on Speech HellaSwag, with gains that continue to grow from 420M to 7B parameters, while reducing ASR/TTS inference computation.
Background & Motivation¶
Background: Discrete speech tokens (e.g., HuBERT at 25 Hz with a 501-entry codebook) have made autoregressive speech language modeling feasible. However, speech token sequences are far longer than their textual counterparts (10–20×), resulting in training and inference efficiency well below that of text LLMs — it is estimated that approximately three orders of magnitude more data are required to achieve comparable capability.
Limitations of Prior Work: - Information density mismatch: The severe asymmetry in sequence length between speech tokens and text tokens impedes cross-modal knowledge transfer. - Unbalanced compute allocation: Most computation during pretraining and inference is spent on long speech sequences rather than meaningful semantic modeling. - Insufficient alignment attempts: Warm initialization (from text LLMs) and interleaved training help, but significant performance gaps between speech→speech and text→text tasks remain. - BPE fails on speech tokens (Cuervo & Marxer 2024) — simple subword segmentation is not applicable to speech.
Key Challenge: Speech modeling requires fine-grained tokens (25 Hz), yet autoregressive modeling is inefficient over long sequences and yields poor cross-modal alignment.
Core Idea: Drawing inspiration from the Byte Latent Transformer (BLT), speech tokens are aggregated into "latent patches" (higher-level autoregressive units). A global Transformer performs modeling at the patch level, while a lightweight decoder expands patches back into speech tokens. Patch granularity is aligned with that of text tokens.
Method¶
Overall Architecture¶
Speech token sequence \(\{s_0, \ldots, s_T\}\) → Patch Encoder (sliding-window self-attention + cross-attention, aggregating into patch representations \(\{z_0, \ldots, z_{T'}\}\), \(T' \ll T\)) → Global Transformer (autoregressive modeling over patches + text tokens) → Patch Decoder (lightweight Transformer + cross-attention, reconstructing speech tokens from patches) → standard NTP loss.
Key Designs¶
-
Three Patching Strategies
- Static Patching: Non-overlapping segmentation of fixed size \(p\) (e.g., \(p=3\), one patch per 3 speech tokens). Simple and efficient; no auxiliary model required at inference.
- Alignment Patching: Wav2Vec2+CTC forced alignment is used to obtain speech–text timestamps; each text unit (word/BPE) corresponds to one patch, with silence segments forming separate patches. This precisely aligns speech and text granularities.
- Curriculum Patching (final approach): Training gradually transitions from alignment to static patching. The probability \(P(u) = 1 \to 0\) decays linearly over training steps \([\tau_1, \tau_2]\). Early training benefits from the semantic correspondence provided by alignment; later training switches to static patching to eliminate dependence on the alignment model at inference.
- Design Motivation: Alignment patching provides the best cross-modal alignment but requires an auxiliary model; curriculum patching retains the benefits while eliminating the inference-time dependency.
-
Patch Encoder and Patch Decoder
- Encoder: Sliding-window self-attention + cross-attention layers aggregate token embeddings into patch embeddings.
- Decoder: Lightweight Transformer with cross-attention inserted at each layer to receive patch-level information; self-attention window of 512 tokens.
- Compute allocation: The global Transformer dominates FLOPs; the Encoder and Decoder are lightweight. By performing global modeling at the patch level rather than the token level, computation is substantially reduced.
-
Cross-Modal Alignment Mechanism
- Patch-level modeling causes speech and text to appear at comparable granularities within the same sequence.
- Interleaved data training: text and speech from the same corpus alternate, with some speech segments replaced by their textual counterparts.
- Effect: Patches automatically learn correspondences to syllables/words, facilitating S↔T knowledge transfer.
Loss & Training¶
- Standard NTP loss (at the token level), applied to the output of the patch decoder.
- End-to-end training (encoder + global Transformer + decoder).
- Speech tokenizer: HuBERT at 25 Hz, 501-entry codebook.
- Text tokenizer: Llama 2 tokenizer.
Key Experimental Results¶
Main Results (Speech HellaSwag, story completion)¶
| Setting | Condition | LST Gain |
|---|---|---|
| Compute-controlled (same training steps) | 420M | +6.5% absolute |
| Data-controlled (same data volume) | 420M | +5.3% absolute |
| Compute-optimal scaling | 420M → 1.8B | Gains increase with scale |
| Fixed-token budget | 7B, 70B tokens | Gains persist |
Key finding: gains not only do not saturate but continue to grow as model size increases, indicating that LST improves compute-optimal scaling.
Downstream Tasks¶
| Task | Result | Notes |
|---|---|---|
| ASR adaptation | More stable | Patch-level modeling reduces long-range dependency issues |
| TTS inference | Shorter sequences, lower compute | Direct benefit of compressed sequence length |
| Reconstruction quality | No degradation | Demonstrates lossless patch compression |
| Text→Text | Also improves | Cross-modal training indirectly boosts text capability |
Ablation Study¶
| Configuration | Result | Notes |
|---|---|---|
| No patching (baseline) | Baseline | Standard interleaved training |
| BPE on speech tokens | No improvement / degradation | Confirms BPE inapplicability to speech tokens |
| Static patching (\(p=3\)) | Significant improvement | Even simple segmentation is effective |
| Alignment patching | Best, but requires auxiliary model | Value of semantic alignment |
| Curriculum patching | Best trade-off | Retains alignment benefits without inference auxiliary model |
Key Findings¶
- Information density alignment is central: Bringing speech and text to comparable sequence lengths substantially improves cross-modal knowledge transfer, supporting the hypothesis that granularity mismatch is the primary bottleneck.
- Even the simplest static patching is effective — indicating that the issue lies not in precise semantic alignment but in reducing redundancy in speech sequences.
- Patches automatically learn semantic correspondences: Curriculum patching begins with alignment but ultimately transitions to static patching; the model retains the learned correspondences, suggesting that alignment signals can be "distilled" into patch representations.
- Gains grow with model scale — an important implication for scaling laws: LST may shift the compute-optimal point for speech LMs.
- Text performance also improves — speech patching not only does not harm text modeling but indirectly enhances it through better cross-modal training.
Highlights & Insights¶
- Successful transfer of the BLT paradigm from text to speech: The core idea of the Byte Latent Transformer — aggregating fine-grained tokens into patches for global modeling — proves equally effective in the speech domain and may be even more impactful given that speech exhibits greater redundancy than bytes.
- Efficiency and quality simultaneously improved: Reducing sequence length while improving quality is not a trade-off but a win-win. The reason: shorter sequences make it easier for the global Transformer to capture long-range dependencies.
- Elegant curriculum design: Alignment patching requires an inference-time auxiliary model (impractical); static patching loses semantic alignment (suboptimal); curriculum patching smoothly transitions from the former to the latter — using alignment during training and static patching at inference.
Limitations & Future Work¶
- The robustness of patch size selection across languages with diverse syllabic structures and varying speaking rates has not been thoroughly validated.
- Only HuBERT semantic tokens are evaluated; codec-based acoustic tokens (e.g., SoundStorm) are not tested.
- No direct comparison with end-to-end speech LLMs such as Moshi or Spirit-LM.
- The curriculum schedule hyperparameters \(\tau_1, \tau_2\) require tuning.
- The 7B experiments are conducted under a suboptimal token budget (70B vs. the estimated optimal ~140B); full compute-optimal experiments are prohibitively expensive.
Related Work & Insights¶
- vs. BLT (Pagnoni 2024): LST transfers the byte→patch idea from BLT directly to speech token→speech patch, serving as the primary inspiration.
- vs. Nguyen 2025 (interleaved training): Interleaved training constitutes the baseline; LST adds patching on top — an orthogonal improvement.
- vs. Moshi (end-to-end speech LLM): Moshi employs multi-stream modeling; LST uses patch compression — distinct approaches to resolving information density mismatch.
Rating¶
- Novelty: ⭐⭐⭐⭐ The latent patch concept is concise and effective; the transfer from BLT to speech is both natural and non-trivially designed.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-scale evaluation (420M→7B) × two controlled settings × downstream tasks × thorough ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear and accessible, with a coherent logical flow from motivation to design to experiments.
- Value: ⭐⭐⭐⭐⭐ ICLR Oral is well deserved; the work offers important guidance for joint speech-text modeling and improves scaling behavior.