Recent Advances in Speech Language Models: A Survey¶
Conference: ACL 2025
arXiv: 2410.03751
Code: GitHub
Area: Speech/LLM
Keywords: speech language model, end-to-end speech, speech tokenizer, vocoder, survey
TL;DR¶
The first comprehensive survey on Speech Language Models (SpeechLMs), systematically tracing the evolution from "ASR+LLM+TTS" cascaded architectures to end-to-end speech language models. It proposes a taxonomy categorized by three key components (speech tokenizer / language model / vocoder) and training strategies, and covers downstream capabilities, evaluation metrics, challenges, and future directions.
Background & Motivation¶
Background: LLMs perform exceptionally in text interaction, but natural human-computer interaction relies on speech. The traditional "ASR+LLM+TTS" three-stage cascade, though intuitive, suffers from three primary issues: (a) information loss (paralinguistic information such as tone/emotion is lost in text); (b) high latency (due to the three stages running serially); (c) cascaded error accumulation (ASR errors propagate to the LLM and subsequently to the TTS).
Limitations of Prior Work: There is a lack of a systematic survey in the SpeechLM field. Existing surveys either focus on traditional speech technologies (SLU/SSL) or pay attention to speech as a subset of multimodal LLMs, lacking a holistic overview centered on "end-to-end speech language models".
Key Challenge: Despite the rapid development of SpeechLMs (such as GPT-4o voice, Moshi, etc.), the research community lacks a systematic understanding of their architectural choices, training strategies, and capability boundaries.
Goal: To provide the first comprehensive survey of the SpeechLM domain, covering architectural components, training schemes, capability taxonomies, and evaluation systems.
Method¶
Formal Definition of SpeechLM¶
SpeechLM is an autoregressive foundational model that directly processes and generates speech sequences \(\mathbf{M}^{\text{out}} = \text{SpeechLM}(\mathbf{M}^{\text{in}}; \theta)\), where \(\mathbf{M}\) can be speech, text, or interleaved multimodal sequences.
Three Core Components¶
-
Speech Tokenizer
- Function: Converts continuous audio waveforms into discrete tokens for autoregressive language modeling.
- Three types:
- Semantic tokenizer: e.g., HuBERT/wav2vec 2.0 + k-means quantization; extracts semantic features but loses paralinguistic details.
- Acoustic tokenizer: e.g., EnCodec/SoundStream; uses RVQ (Residual Vector Quantization) to preserve acoustic details (timbre/pitch), though semantics may be diluted.
- Hybrid tokenizer: combines both approaches (e.g., SpeechTokenizer decoupling semantic and acoustic layers), capturing both semantics and paralinguistics.
- Key trade-off: Semantic tokens represent high-level abstraction beneficial for understanding, while acoustic tokens represent low-level details beneficial for generation.
-
Language Model (LM Backbone)
- Function: Performs next-token prediction on speech tokens, acting as the core "brain".
- Integration Approaches:
- Direct modeling: Pre-training a decoder-only Transformer on speech tokens (e.g., GSLM, AudioPaLM).
- Adapting existing TextLMs: Freezing the LLM and attaching a speech adapter (e.g., Qwen-Audio, SALMONN).
- Joint training: Mixed training on both text and speech tokens (e.g., Spirit-LM interleaving text/speech tokens).
- Multi-stream generation: Single-stream autoregressive vs. multi-stream parallel decoding (e.g., VALL-E uses a 2-stage approach: AR to generate coarse tokens \(\rightarrow\) NAR to complete fine tokens).
-
Vocoder
- Function: Converts tokens or representations output by the LM into audio waveforms.
- Main Methods:
- HiFi-GAN family: Direct conversion from mel-spectrogram/tokens to waveform (fast).
- Diffusion models: e.g., DiffWave, high-quality but slow.
- Token decoder: e.g., EnCodec decoder directly converts RVQ tokens into waveforms.
Taxonomy of Training Schemes¶
| Stage | Method | Representative Works |
|---|---|---|
| Pre-training | Speech continuation (next-token prediction on speech) | GSLM, AudioLM |
| Pre-training | Joint speech-text pre-training | Spirit-LM, SpeechGPT |
| Alignment | Multi-task training for ASR/TTS | Whisper, Qwen-Audio |
| Alignment | Interleaved speech-text token training | Spectron, LauraGPT |
| Fine-tuning | Instruction tuning + RLHF alignment | GPT-4o, some closed-source models |
Key Experimental Results¶
Comparison of Representative SpeechLMs¶
| Model | Speech Tokenizer | LM | Vocoder | Capabilities |
|---|---|---|---|---|
| GSLM | HuBERT+kmeans | Transformer | code-HiFiGAN | Speech continuation |
| AudioLM | w2v-BERT+SoundStream | Transformer | SoundStream | Speech generation |
| VALL-E | EnCodec | AR+NAR Transformer | EnCodec dec. | Zero-shot TTS |
| SpeechGPT | HuBERT+kmeans | LLaMA | code-HiFiGAN | Dialogue |
| Spirit-LM | HuBERT+pitch/style | LLaMA | HiFi-GAN | Interleaved Text + Speech |
| Qwen-Audio | Whisper encoder | Qwen-7B | - | Understanding (no generation) |
SpeechLM Capability Classification¶
| Capability Category | Specific Tasks | Description |
|---|---|---|
| Speech understanding | ASR, SLU, emotion recognition | Foundational capabilities |
| Speech generation | TTS, voice cloning, speech editing | Core generation |
| Conversational interaction | Speech dialogue, real-time interruption | GPT-4o-level capability |
| Paralinguistics | Emotional expression, speaking style control | Distinguishing SpeechLM from ASR+LLM+TTS |
| Multilingual | Cross-lingual speech translation | Extension capabilities |
Key Findings¶
- Semantic vs. Acoustic tokenizer is a core design choice: Understanding tasks prefer semantic tokens, whereas generation tasks require acoustic tokens. Blending both approach paradigms is the current trend.
- Adapting existing TextLMs is more practical than training from scratch: Freezing the LLM and employing an adapter achieves the optimal balance between resource efficiency and performance.
- Real-time interaction remains an open challenge: The latency of current SpeechLMs (especially AR decoding) falls short of meeting real-time conversational requirements.
- Unified evaluation framework is lacking: Diverse works employ different metrics (e.g., WER, MOS, PESQ, speaker similarity), lacking a unified benchmark.
Highlights & Insights¶
- First comprehensive survey in the SpeechLM domain: Provides a timely and systematic review following the surge of interest sparked by GPT-4o voice.
- Clear three-component taxonomy: The decomposition framework of tokenizer \(\rightarrow\) LM \(\rightarrow\) vocoder makes complex architectures easy to understand.
- Accurate summarization of the "three major flaws of ASR+LLM+TTS": Information loss, high latency, and cascaded error propagation—providing a clear rationale for the existence of SpeechLMs.
- The hybrid tokenizer direction: Decoupling semantic and acoustic layers (e.g., SpeechTokenizer) offers an elegant solution to resolve the tension between "understanding vs. generation".
Limitations & Future Work¶
- Extremely rapid domain advancement: Implementation details of closed-source systems like GPT-4o and Moshi remain publically unknown, leading to potential omissions in this review.
- Lack of quantitative comparison: A systematic comparison of different SpeechLMs on a unified benchmark is missing (each work utilizes distinct datasets and evaluation metrics).
- Insufficient safety and ethical discussion: Risks such as voice cloning abuse and deepfake speech detection are not deeply investigated.
- Exclusion of wider multimodality: The text focuses solely on speech and text, without considering broader multimodal LLMs of speech+vision.
Related Work & Insights¶
- vs. Whisper (Radford et al., 2023): Whisper is an encoder-only understanding model, whereas this SpeechLM survey encompasses the full paradigm of both understanding and generation.
- vs. AudioLM (Borsos et al., 2023): AudioLM represents an early attempt in SpeechLMs, while this survey encompasses the rapid evolution that succeeded it.
- vs. Multimodal LM surveys (Zhang et al., 2024): Multimodal surveys span across vision, audio, and text, whereas this work focuses specifically on an in-depth analysis of the speech modality.
Rating¶
- Novelty: ⭐⭐⭐⭐ First survey of SpeechLMs; taxonomic system is highly valuable.
- Experimental Thoroughness: ⭐⭐ Pure survey with no original experiments.
- Writing Quality: ⭐⭐⭐⭐ Clear categorization system and rich diagrams (especially the taxonomy tree in Figure 4).
- Value: ⭐⭐⭐⭐⭐ Provides a highly demanded systematic reference for the rapidly evolving SpeechLM field.