Recent Advances in Speech Language Models: A Survey¶

Conference: ACL 2025
arXiv: 2410.03751
Code: GitHub
Area: Speech/LLM
Keywords: speech language model, end-to-end speech, speech tokenizer, vocoder, survey

TL;DR¶

The first comprehensive survey on Speech Language Models (SpeechLMs), systematically tracing the evolution from "ASR+LLM+TTS" cascaded architectures to end-to-end speech language models. It proposes a taxonomy categorized by three key components (speech tokenizer / language model / vocoder) and training strategies, and covers downstream capabilities, evaluation metrics, challenges, and future directions.

Background & Motivation¶

Background: LLMs perform exceptionally in text interaction, but natural human-computer interaction relies on speech. The traditional "ASR+LLM+TTS" three-stage cascade, though intuitive, suffers from three primary issues: (a) information loss (paralinguistic information such as tone/emotion is lost in text); (b) high latency (due to the three stages running serially); (c) cascaded error accumulation (ASR errors propagate to the LLM and subsequently to the TTS).

Limitations of Prior Work: There is a lack of a systematic survey in the SpeechLM field. Existing surveys either focus on traditional speech technologies (SLU/SSL) or pay attention to speech as a subset of multimodal LLMs, lacking a holistic overview centered on "end-to-end speech language models".

Key Challenge: Despite the rapid development of SpeechLMs (such as GPT-4o voice, Moshi, etc.), the research community lacks a systematic understanding of their architectural choices, training strategies, and capability boundaries.

Goal: To provide the first comprehensive survey of the SpeechLM domain, covering architectural components, training schemes, capability taxonomies, and evaluation systems.

Method¶

Formal Definition of SpeechLM¶

SpeechLM is an autoregressive foundational model that directly processes and generates speech sequences \(\mathbf{M}^{\text{out}} = \text{SpeechLM}(\mathbf{M}^{\text{in}}; \theta)\), where \(\mathbf{M}\) can be speech, text, or interleaved multimodal sequences.

Three Core Components¶

Speech Tokenizer
- Function: Converts continuous audio waveforms into discrete tokens for autoregressive language modeling.
- Three types:
- Semantic tokenizer: e.g., HuBERT/wav2vec 2.0 + k-means quantization; extracts semantic features but loses paralinguistic details.
- Acoustic tokenizer: e.g., EnCodec/SoundStream; uses RVQ (Residual Vector Quantization) to preserve acoustic details (timbre/pitch), though semantics may be diluted.
- Hybrid tokenizer: combines both approaches (e.g., SpeechTokenizer decoupling semantic and acoustic layers), capturing both semantics and paralinguistics.
- Key trade-off: Semantic tokens represent high-level abstraction beneficial for understanding, while acoustic tokens represent low-level details beneficial for generation.
Language Model (LM Backbone)
- Function: Performs next-token prediction on speech tokens, acting as the core "brain".
- Integration Approaches:
- Direct modeling: Pre-training a decoder-only Transformer on speech tokens (e.g., GSLM, AudioPaLM).
- Adapting existing TextLMs: Freezing the LLM and attaching a speech adapter (e.g., Qwen-Audio, SALMONN).
- Joint training: Mixed training on both text and speech tokens (e.g., Spirit-LM interleaving text/speech tokens).
- Multi-stream generation: Single-stream autoregressive vs. multi-stream parallel decoding (e.g., VALL-E uses a 2-stage approach: AR to generate coarse tokens \(\rightarrow\) NAR to complete fine tokens).
Vocoder
- Function: Converts tokens or representations output by the LM into audio waveforms.
- Main Methods:
- HiFi-GAN family: Direct conversion from mel-spectrogram/tokens to waveform (fast).
- Diffusion models: e.g., DiffWave, high-quality but slow.
- Token decoder: e.g., EnCodec decoder directly converts RVQ tokens into waveforms.

Taxonomy of Training Schemes¶

Stage	Method	Representative Works
Pre-training	Speech continuation (next-token prediction on speech)	GSLM, AudioLM
Pre-training	Joint speech-text pre-training	Spirit-LM, SpeechGPT
Alignment	Multi-task training for ASR/TTS	Whisper, Qwen-Audio
Alignment	Interleaved speech-text token training	Spectron, LauraGPT
Fine-tuning	Instruction tuning + RLHF alignment	GPT-4o, some closed-source models

Key Experimental Results¶

Comparison of Representative SpeechLMs¶

Model	Speech Tokenizer	LM	Vocoder	Capabilities
GSLM	HuBERT+kmeans	Transformer	code-HiFiGAN	Speech continuation
AudioLM	w2v-BERT+SoundStream	Transformer	SoundStream	Speech generation
VALL-E	EnCodec	AR+NAR Transformer	EnCodec dec.	Zero-shot TTS
SpeechGPT	HuBERT+kmeans	LLaMA	code-HiFiGAN	Dialogue
Spirit-LM	HuBERT+pitch/style	LLaMA	HiFi-GAN	Interleaved Text + Speech
Qwen-Audio	Whisper encoder	Qwen-7B	-	Understanding (no generation)

SpeechLM Capability Classification¶

Capability Category	Specific Tasks	Description
Speech understanding	ASR, SLU, emotion recognition	Foundational capabilities
Speech generation	TTS, voice cloning, speech editing	Core generation
Conversational interaction	Speech dialogue, real-time interruption	GPT-4o-level capability
Paralinguistics	Emotional expression, speaking style control	Distinguishing SpeechLM from ASR+LLM+TTS
Multilingual	Cross-lingual speech translation	Extension capabilities

Key Findings¶

Semantic vs. Acoustic tokenizer is a core design choice: Understanding tasks prefer semantic tokens, whereas generation tasks require acoustic tokens. Blending both approach paradigms is the current trend.
Adapting existing TextLMs is more practical than training from scratch: Freezing the LLM and employing an adapter achieves the optimal balance between resource efficiency and performance.
Real-time interaction remains an open challenge: The latency of current SpeechLMs (especially AR decoding) falls short of meeting real-time conversational requirements.
Unified evaluation framework is lacking: Diverse works employ different metrics (e.g., WER, MOS, PESQ, speaker similarity), lacking a unified benchmark.

Highlights & Insights¶

First comprehensive survey in the SpeechLM domain: Provides a timely and systematic review following the surge of interest sparked by GPT-4o voice.
Clear three-component taxonomy: The decomposition framework of tokenizer \(\rightarrow\) LM \(\rightarrow\) vocoder makes complex architectures easy to understand.
Accurate summarization of the "three major flaws of ASR+LLM+TTS": Information loss, high latency, and cascaded error propagation—providing a clear rationale for the existence of SpeechLMs.
The hybrid tokenizer direction: Decoupling semantic and acoustic layers (e.g., SpeechTokenizer) offers an elegant solution to resolve the tension between "understanding vs. generation".

Limitations & Future Work¶

Extremely rapid domain advancement: Implementation details of closed-source systems like GPT-4o and Moshi remain publically unknown, leading to potential omissions in this review.
Lack of quantitative comparison: A systematic comparison of different SpeechLMs on a unified benchmark is missing (each work utilizes distinct datasets and evaluation metrics).
Insufficient safety and ethical discussion: Risks such as voice cloning abuse and deepfake speech detection are not deeply investigated.
Exclusion of wider multimodality: The text focuses solely on speech and text, without considering broader multimodal LLMs of speech+vision.

vs. Whisper (Radford et al., 2023): Whisper is an encoder-only understanding model, whereas this SpeechLM survey encompasses the full paradigm of both understanding and generation.
vs. AudioLM (Borsos et al., 2023): AudioLM represents an early attempt in SpeechLMs, while this survey encompasses the rapid evolution that succeeded it.
vs. Multimodal LM surveys (Zhang et al., 2024): Multimodal surveys span across vision, audio, and text, whereas this work focuses specifically on an in-depth analysis of the speech modality.

Rating¶

Novelty: ⭐⭐⭐⭐ First survey of SpeechLMs; taxonomic system is highly valuable.
Experimental Thoroughness: ⭐⭐ Pure survey with no original experiments.
Writing Quality: ⭐⭐⭐⭐ Clear categorization system and rich diagrams (especially the taxonomy tree in Figure 4).
Value: ⭐⭐⭐⭐⭐ Provides a highly demanded systematic reference for the rapidly evolving SpeechLM field.