DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling¶

Conference: AAAI 2026 arXiv: 2508.08961 Code: https://github.com/lavendery/UUG Area: Audio & Speech / Speech Large Language Models Keywords: speech large language model, dual token modeling, speech understanding and generation, speech tokenizer, unified framework

TL;DR¶

This paper proposes DualSpeechLM, a framework that extracts high-level semantic tokens via an understanding-driven speech tokenizer (USTokenizer) as LLM input and uses acoustic tokens as output, jointly optimizing speech understanding and generation within a single end-to-end framework.

Background & Motivation¶

Background: Speech large language models (Speech LLMs) built upon text LLMs have advanced rapidly in recent years, encompassing understanding-oriented models (QwenAudio, SALMONN) and generation-oriented models (SEED-TTS, UniAudio). Unified understanding-and-generation approaches (SpeechGPT, Moshi, Mini-Omni2) are also being actively explored.

Limitations of Prior Work: - Data dependency: Adapting text LLMs into unified speech LLMs requires large amounts of paired data due to the substantial modality gap between speech and text (SpeechGPT requires 70K hours; SpiritLM requires 570K hours). - Task conflict: Generation tasks require rich acoustic details (prosody, emotion, speaker characteristics), whereas understanding tasks require high-level semantic features. A single token type cannot serve both objectives well—acoustic tokens degrade understanding performance, while semantic tokens degrade generation quality.

Key Challenge: A single token type cannot simultaneously satisfy the distinct information requirements of understanding (semantics-oriented) and generation (acoustics-oriented); improving one objective typically harms the other.

Goal: Achieve mutual benefit between speech understanding and generation under low-resource data conditions, rather than a zero-sum trade-off.

Key Insight: Innovations are proposed along two dimensions—speech tokenization and language modeling—by designing an understanding-driven tokenizer and a dual-token modeling framework.

Core Idea: High-level semantic tokens (USTokens) are used as input to reduce the modality alignment difficulty and enhance understanding, while acoustic tokens are used as output to preserve acoustic details for high-quality generation. Both are jointly trained within a unified end-to-end framework.

Method¶

Overall Architecture¶

DualSpeechLM consists of two core modules:

USTokenizer: Extracts understanding-driven tokens from speech that are aligned with the semantic space of a text LLM.
DualSpeechLM main framework: A dual-token LLM that takes USTokens as input and produces acoustic tokens as output.

Key Designs¶

Understanding-Driven Speech Tokenizer (USTokenizer):
- Architecture: pretrained Whisper encoder → downsampling encoder → vector quantization (VQ, single codebook) → upsampling decoder.
- Key innovation: An Adapter module projects VQ-quantized vectors into the input space of a frozen text LLM; the semantic content of the tokens is optimized through backpropagation from understanding tasks.
- Training loss: \(\mathcal{L}_{\text{USTokenizer}} = \alpha \cdot \mathcal{L}_{\text{commit}} + \beta \cdot \mathcal{L}_{\text{Under}} + \gamma \cdot \mathcal{L}_{\text{reconstruction}}\)
- The understanding loss \(\mathcal{L}_{\text{Under}}\) is the autoregressive generation likelihood of the text LLM given speech input, so token optimization is directly guided by the semantic space of the text LLM.
- Unlike prior semantic tokenizers based on SSL quantization (HuBERT) or ASR intermediate-layer quantization (CosyVoice), USTokenizer is explicitly aligned with the semantic capability of the text LLM, substantially reducing the modality alignment burden.
Dual-Token Modeling Architecture:
- Input side: USTokens provide high-level semantic information and are fed directly into the text LLM.
- Output side: Rather than directly generating USTokens (which lack acoustic detail), the AcousticGPT module converts LLM hidden states into acoustic tokens.
- AcousticGPT is integrated within the text LLM and trained jointly, forming an end-to-end pipeline.
- Understanding path: speech → USTokens → LLM → text output.
- Generation path: (prompt + USTokens) → LLM predicts target USTokens → AcousticGPT produces acoustic tokens → waveform.
Semantic Supervision Loss:
- An auxiliary supervision signal on intermediate USToken prediction is added to the generation path to prevent the LLM from losing semantic information.
- This serves as a regularization mechanism to stabilize joint dual-token training.
Chain-of-Condition (CoC) Strategy:
- Rather than generating acoustic tokens directly from input USTokens in a single step, the LLM first autoregressively generates target USTokens, which are then used to produce acoustic tokens.
- Conceptually analogous to Chain-of-Thought reasoning but applied to speech generation, providing more stable intermediate conditioning.

Loss & Training¶

USTokenizer: commitment loss + understanding loss + reconstruction loss.
DualSpeechLM: cross-entropy for the understanding branch; acoustic token prediction loss + semantic supervision loss for the generation branch.
Only 4.5K hours of training data are used (compared to 570K hours for SpiritLM).
Built upon Phi3.5-3B with LoRA fine-tuning rather than full-parameter training.

Key Experimental Results¶

Main Results¶

Understanding performance (WER↓ lower is better):

Model	LLM	Training Data	ASR-Clean	ASR-Other	SQA (b4↑/gs↑)
SpeechGPT	LLaMA-7B	70K hrs	42.73	78.54	3.58/40
SpiritLM	LLaMA-7B	570K hrs	6.0	11.0	—
Baseline-Acoustic	Phi3.5-3B	4.5K hrs	36.52	80.06	17.68/76
Baseline-Semantic	Phi3.5-3B	4.5K hrs	5.70	14.32	42.01/85
DualSpeechLM (USToken)	Phi3.5-3B	4.5K hrs	4.22	9.71	44.38/88

Generation performance (TTS, SIM↑/WER↓/DNSMOS↑):

Model	Clean	Other
Baseline-Acoustic	0.88/22.11/3.76	0.87/26.38/3.69
Baseline-Semantic	0.80/21.72/3.29	0.81/22.32/3.26
DualSpeechLM (USToken)	0.90/9.25/3.86	0.88/9.88/3.82

Ablation Study¶

Data proportion experiments (key finding): - Baseline models: increasing generation data degrades understanding performance, and increasing understanding data degrades generation performance (task conflict). - DualSpeechLM: increasing data in either direction simultaneously improves performance on both dimensions (mutual benefit).

Token type comparison: - DualSpeechLM + HuBERT tokens: improvements in both understanding and generation, but limited. - DualSpeechLM + USTokens: substantial gains in both understanding and generation, validating the core contribution of USTokens.

Key Findings¶

Using only 4.5K hours of data, the model surpasses SpiritLM trained on 570K hours, demonstrating that USTokens substantially reduce data requirements for modality alignment.
The dual-token design successfully breaks the zero-sum trade-off between understanding and generation, enabling positive mutual reinforcement.
USTokens significantly outperform HuBERT tokens on both understanding and generation tasks.

Highlights & Insights¶

Decoupling input tokens from output tokens is a concise yet profound design insight: understanding and generation have fundamentally different information granularity requirements, and constraining both to a single token type is an unnecessary restriction.
USTokenizer leverages the understanding capability of the text LLM to guide speech token learning in reverse, constituting an elegant form of cross-modal knowledge distillation.
Exceeding prior methods with only 1% of the training data (4.5K vs. 570K hours) represents a remarkable improvement in data efficiency.

Limitations & Future Work¶

Built upon Phi3.5-3B, a relatively small LLM; scalability to larger models has not been verified.
USTokenizer remains dependent on the output quality of the Whisper encoder.
Acoustic tokens are produced by WavTokenizer (single codebook); multi-codebook schemes may further improve generation quality.
Evaluation is limited to English data; multilingual generalization remains unexplored.
The CoC strategy introduces additional inference latency due to the sequential generation of USTokens followed by acoustic tokens.

SpeechGPT / SpiritLM: unified models using HuBERT tokens, but requiring an additional Mel-to-waveform stage.
Moshi: a real-time dialogue model employing multi-codebook acoustic tokens.
Qwen2.5-Omni: uses continuous Whisper features rather than discrete tokens.
Inspiration: the dual-token paradigm is potentially generalizable to vision-language models, using high-level visual tokens for understanding and pixel-level tokens for generation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Both the dual-token decoupling design and the understanding-driven tokenizer represent clear and compelling innovations.
Experimental Thoroughness: ⭐⭐⭐⭐ Bidirectional evaluation across understanding and generation, with convincing data-proportion ablations.
Writing Quality: ⭐⭐⭐⭐ Intuitive figures and well-structured progressive argumentation.
Value: ⭐⭐⭐⭐⭐ Provides an elegant and data-efficient paradigm for unified speech large language models.