Adapting Speech Language Model to Singing Voice Synthesis¶
Conference: NeurIPS 2025 (Workshop) arXiv: 2512.14657 Code: https://tsukasane.github.io/SLMSVS/ Area: Image Generation Keywords: Speech Language Model, SVS, Flow Matching, Codec, Singing Voice Synthesis
TL;DR¶
This paper adapts a 1.7B-parameter TTS-pretrained Speech Language Model to the Singing Voice Synthesis (SVS) task via score tokenization, multi-stream LM prediction, conditional flow matching refinement, and a vocoder. Using only 135 hours of synthesized singing data, the system achieves performance comparable to dedicated SVS systems.
Background & Motivation¶
Background: Speech Language Models (SLMs) have emerged as a unified paradigm for speech tasks such as TTS, ASR, and speech enhancement, yet their generalization capability to singing voice synthesis remains unexplored.
Limitations of Prior Work: - Public SVS datasets are extremely scarce due to copyright restrictions and high annotation costs, making it infeasible to train large models from scratch. - SVS inputs are structured musical scores (phonemes + pitch + duration), which are considerably more complex than the plain text inputs used in TTS. - Codec decoders pretrained on speech cannot faithfully resynthesize singing voice, imposing a hard performance ceiling.
Key Challenge: The generalization potential of large-scale SLMs versus the scarcity of SVS data.
Goal: To investigate whether a TTS-pretrained SLM can be adapted to SVS at low cost.
Key Insight: Tokenize the score-based conditions and incorporate them into the SLM vocabulary, then fine-tune the model and apply flow matching for acoustic refinement.
Core Idea: Leverage a TTS-pretrained SLM combined with flow matching refinement to address the low-resource challenge in singing voice synthesis.
Method¶
Overall Architecture¶
Input: Musical score (phonemes + MIDI pitch + duration) and a speaker prompt. The pipeline proceeds as follows: (1) the score is tokenized into 50 FPS discrete tokens; (2) audio is encoded into multi-stream tokens via a codec encoder and an SSL model; (3) the LM predicts the target token sequence; (4) flow matching maps the LM-predicted codec tokens to a mel spectrogram; (5) a HiFi-GAN vocoder synthesizes the final waveform.
Key Designs¶
-
Score Tokenization (
svs_lb): -
Function: Encodes phonemes, MIDI pitch, and duration into frame-level discrete tokens.
- Mechanism: Each frame is represented as a (phoneme_token, pitch_token) tuple; duration is implicitly encoded via repetition count: \(\text{repeat} = (\text{end} - \text{start}) \times \text{fps}\). A new
svs_lbmodality is introduced to extend the TTS vocabulary. -
Design Motivation: Maintains consistency with the SLM's token prediction paradigm and reuses the TTS pretrained encoder.
-
Multi-stream LM Token Prediction:
-
Function: Uses the 1.7B SLM to predict concatenated SSL and 8-layer codec tokens.
- Mechanism: Built on ESPNet-SpeechLM; the model takes score conditions and a speaker prompt as input and is trained with cross-entropy loss over frame-level SSL+codec tokens.
-
Design Motivation: SSL tokens capture high-level semantics while codec tokens encode acoustic details; concatenating both combines their complementary strengths.
-
Flow Matching Refinement:
-
Function: Refines the noisy codec tokens predicted by the LM into clean mel spectrograms.
- Mechanism: Conditional Flow Matching (CFM) starts from Gaussian noise and learns a velocity field conditioned on codec tokens and pitch signals to transport samples toward the target mel distribution. A linear interpolation path is used: \(\psi_t(x|x_1) = (1-t)x + tx_1\).
- Design Motivation: Tokens directly predicted by the LM are noisy, causing temporal discontinuities and perceptual artifacts; furthermore, the codec decoder pretrained on speech cannot faithfully resynthesize singing. Flow matching circumvents both bottlenecks.
Loss & Training¶
- LM fine-tuning: Cross-entropy loss, maximizing \(P(s|m,p)\).
- Flow matching: MSE loss on the conditional velocity field.
- A HiFi-GAN vocoder with parameters aligned to the codec STFT is additionally trained.
Key Experimental Results¶
Main Results¶
Evaluated on the ACE-Opencpop dataset (135 hours of synthesized singing voice).
| Method | F0_RMSE↓ | F0_CORR↑ | MCD↓ | PER↓ | SingMOS↑ |
|---|---|---|---|---|---|
| XiaoiceSing | 71.67 | 0.62 | 11.47 | 0.09 | 3.88 |
| TokSing | 55.83 | 0.67 | 6.77 | 0.19 | 4.08 |
| LM + Flow + Voc (ours) | 62.79 | 0.60 | 7.86 | 0.36 | 4.09 |
The proposed system achieves SingMOS (perceptual quality) on par with the best dedicated SVS system, TokSing.
Ablation Study¶
| Configuration | MCD↓ | PER↓ | SingMOS↑ | Notes |
|---|---|---|---|---|
| LM + CD (codec decoder) | 8.26 | 0.56 | 3.65 | Direct codec decoding; poor quality |
| LM + Flow1 + CD | 8.44 | 0.45 | 3.64 | Flow refinement but still uses codec decoder |
| LM + Flow1 + Voc | 7.86 | 0.36 | 4.09 | Flow + dedicated vocoder; best overall |
| CD Resynthesis (upper bound) | 5.84 | 0.19 | 3.95 | Upper bound of the codec decoder |
Key Findings¶
- The codec decoder is the dominant bottleneck, as a decoder pretrained on speech is ill-suited for singing voice.
- Flow matching refinement combined with a dedicated vocoder yields substantial quality improvements (SingMOS: 3.65 → 4.09).
- LM + Flow + Voc even surpasses the codec resynthesis SingMOS upper bound (4.09 vs. 3.95), indicating that flow matching can compensate for codec decoder deficiencies.
- PER remains higher than dedicated systems, suggesting room for improvement in lyric intelligibility.
Highlights & Insights¶
- Cross-task generalization of SLMs: Adapting a TTS SLM to SVS with only 135 hours of data validates the generalization potential of large pretrained models.
- Flow matching as a decoding bridge: Elegantly resolves the domain mismatch of the pretrained codec decoder, serving as a general-purpose domain gap bridging strategy.
- Complementary two-stage design: The LM handles sequence modeling (temporal structure) while flow matching handles acoustic quality (spectral detail), with clearly delineated responsibilities.
Limitations & Future Work¶
- PER remains high (0.36 vs. TokSing's 0.19), indicating insufficient lyric pronunciation clarity.
- F0-related metrics are relatively low (0.60 vs. 0.67), leaving room for improvement in pitch tracking accuracy.
- Evaluation is limited to Mandarin singing (Opencpop); multilingual generalization has not been verified.
- The 135 hours of training data are synthesized; performance on real singing recordings remains unknown.
Related Work & Insights¶
- vs. TokSing: A dedicated SVS system that achieves better objective metrics but comparable perceptual quality. The proposed approach benefits from reusing TTS pretraining.
- vs. XiaoiceSing: A conventional end-to-end SVS system with the lowest PER but highest MCD and weaker timbre modeling.
- vs. ESPNet-SpeechLM: The proposed method builds on its framework and demonstrates its multi-task extensibility.
Rating¶
- Novelty: ⭐⭐⭐ The idea of adapting SLMs to SVS is interesting, but the technical contribution is incremental.
- Experimental Thoroughness: ⭐⭐⭐ Ablations are thorough, but evaluation is limited to a single dataset.
- Writing Quality: ⭐⭐⭐⭐ Concise and clear (workshop paper).
- Value: ⭐⭐⭐ Validates cross-task generalization of SLMs and offers insights for low-resource speech generation.