Skip to content

Language-Codec: Bridging Discrete Codec Representations and Speech Language Models

Conference: ACL 2025
arXiv: 2402.12208
Area: LLM/NLP
Keywords: Discrete audio codec, residual vector quantization, speech language models, masked channel mechanism, Fourier transform decoder

TL;DR

Proposed Language-Codec, which bridges the gap between discrete codec representations and downstream speech language models via a Masked Channel Residual Vector Quantization (MCRVQ) mechanism and an improved Fourier transform decoder, achieving high-quality audio reconstruction using only 4 codebook channels.

Background & Motivation

Large language models have achieved significant progress in speech synthesis, music generation, and audio generation, where discrete acoustic codecs (e.g., Encodec, SoundStream) serve as core components to convert continuous audio signals into discrete token sequences. However, two key issues persist between existing codecs and downstream speech language models:

Information Overload in the First Codebook: Due to the reconstruction paradigm of codecs and the structural characteristics of Residual Vector Quantization (RVQ), the first layer of the codebook contains excessive audio information. When downstream tasks require directly generating acoustic tokens from weak supervision signals such as text, the highly dense information in the first-layer codebook leads to generation difficulties.

Excessive Number of Codebooks: To generate high-quality audio, a large number of codebook layers are typically required, which significantly increases the modeling burden on downstream speech language models, resulting in excessively long token sequences or exponentially growing codebook space.

The authors argue that in downstream speech language models, the first-layer quantizer essentially acts as an intermediate bridge between the text input and subsequent quantizers; thus, it is necessary to redesign the codec from the perspective of speech language models.

Method

Overall Architecture

Language-Codec follows the traditional three-stage architecture of encoder-quantizer-decoder, introducing targeted improvements to each module:

  • Encoder: Follows the structure of Encodec, comprising a 1D convolution, four downsampling convolutional blocks (with strides of 2, 4, 5, 8), a two-layer LSTM, and a final convolutional layer, yielding 75 latent steps per second at a 24kHz sampling rate.
  • Quantizer: Proposes a novel Masked Channel Residual Vector Quantization (MCRVQ) module.
  • Decoder: Adopts a Vocos-style Fourier transform decoder to replace the traditional transposed convolutional upsampling structure, and introduces an attention module to enhance sequence modeling capabilities.

Key Designs

1. Masked Channel Residual Vector Quantization (MCRVQ)

The core idea of MCRVQ is to transform the first \(N_q\) (set to 3 in experiments) layers of quantizers into a parallel structure, where each quantizer processes only \(\frac{1}{N_q}\) of the latent space information \(Z\). Specifically, the compressed audio frame is divided equally into \(N_q\) portions, and designated portions are masked out for each quantizer, retaining only \(\frac{1}{N_q}\) of the input information. Quantizers from the \(N_q+1\)-th layer onward return to the serial RVQ mode, processing the residuals from all preceding quantizers.

This design ensures that: - The first-layer codebook no longer carries excessive information, reducing the difficulty of generating first-layer tokens from weak signals such as text. - Information is dispersed more evenly across channels, achieving high-quality reconstruction with only 4 codebook channels.

2. Fourier Transform Decoder

Instead of using traditional transposed convolutional upsampling (which is prone to aliasing artifacts), the decoder maintains consistent resolution across all deep features and performs waveform reconstruction via the inverse Fourier transform. The core workflow is: - The quantized intermediate signal \(Z_q\) is processed through a Conv1D, an attention module, and ConvNeXt blocks. - The output is split into magnitude and phase components. - The complex spectrum is reconstructed via \(STFT = \exp(q) \cdot (\cos p + j\sin p)\). - The final audio waveform is obtained through the inverse Fourier transform.

3. Multi-Scale Discriminator

During training, a combination of four discriminators is employed: the Multi-Period Discriminator (MPD), Multi-Resolution Discriminator (MRD), Multi-Scale Discriminator (MSD), and Complex STFT Discriminator, using the hinge loss as the adversarial loss.

Key Experimental Results

Main Results

On the LibriTTS Test-Clean dataset (3.0kbps / 4 codebooks):

Model UTMOS ↑ PESQ ↑ STOI ↑ V/UV F1 ↑ SPK ↑
Encodec 2.3070 2.0517 0.9007 0.9198 0.7860
Vocos 3.5390 2.4026 0.9231 0.9358 0.7892
SpeechTokenizer 3.5632 1.9311 0.8778 0.9273 0.6587
DAC 2.9902 2.4091 0.9118 0.9531 0.8129
Language-Codec Superior Superior Superior Superior Superior

Using only 4 codebook channels, Language-Codec comprehensively outperforms competing methods across all metrics.

Key Findings

  • At the same bitrate (3.0kbps), Language-Codec achieves the reconstruction quality of 8 codebook channels of other methods using only 4 codebook channels.
  • The MCRVQ mechanism effectively distributes information evenly across codebook layers, significantly reducing the information density of the first layer.
  • Its effectiveness is also validated in downstream zero-shot TTS tasks, indicating that the improved codec representations indeed benefit the modeling of speech language models.

Highlights & Insights

  1. Unique Perspective: For the first time, a codec is designed starting from the requirements of downstream speech language models, rather than solely pursuing reconstruction quality.
  2. Exquisite MCRVQ Design: It utilizes a parallel masking mechanism to distribute information evenly across all codebook channels while maintaining end-to-end trainability.
  3. Modern Decoder: It adopts the Fourier transform instead of transposed convolutional upsampling, avoiding aliasing artifacts while achieving high-quality reconstruction.
  4. High Practical Value: Having only 4 codebook channels means the modeling complexity of downstream language models is significantly reduced.

Limitations & Future Work

  • The paper primarily evaluates on speech data; its generalization capability to music and general audio has not been fully explored.
  • The configuration of \(N_q=3\) in MCRVQ lacks sufficient ablation analysis, and the impact of different values remains unclear.
  • The training data scale reaches 50,000 hours, and its effectiveness on smaller scales of data is yet to be verified.
  • Fair comparisons with recent methods like SemantiCodec are limited (due to large differences in inference speed).
  • SoundStream / Encodec: Classic RVQ-based codecs, serving as key baselines for this work.
  • Vocos: A Fourier transform-based neural vocoder, from which the decoder in this work borrows its structure.
  • DAC: Introduces factorized codes and quantizer dropout, standing as one of the state-of-the-art codecs.
  • SpeechTokenizer: Introduces the concept of semantic tokens in the first-layer channel.
  • VALL-E: A zero-shot TTS system based on discrete codec tokens, serving as the primary downstream validation scenario.

Rating

  • Novelty: ⭐⭐⭐⭐ — The MCRVQ mechanism is novel, and designing codecs from the perspective of downstream models is noteworthy.
  • Practicality: ⭐⭐⭐⭐⭐ — Reducing the number of codebooks brings direct practical value to downstream speech generation tasks.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluation across multiple datasets and metrics is relatively comprehensive, but the ablation study could be deeper.
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure with sufficient technical details.