LeVo: High-Quality Song Generation with Multi-Preference Alignment¶

Conference: NeurIPS 2025 arXiv: 2506.07520 Code: GitHub Area: Audio & Speech Generation Keywords: song generation, language model, multi-preference alignment, DPO, music codec

TL;DR¶

This paper proposes LeVo, a song generation framework that employs a language model to jointly model mixed tokens and dual-track tokens, thereby reconciling vocal-accompaniment harmony with audio quality. It further introduces a DPO-based multi-preference alignment method to enhance musicality and instruction-following capability.

Background & Motivation¶

Background: Advances in LLMs and audio language models have accelerated progress in lyrics-to-song generation. Jukebox pioneered the paradigm of predicting mixed tokens with language models; YuE introduced dual-track (vocal + accompaniment) token prediction; SongGen explored interleaved prediction patterns. Industrial systems (Suno, Mureka, Udio) have demonstrated strong results but remain technically closed.

Limitations of Prior Work: - Mixed-token methods have limited vocabularies and cannot fully capture complex vocal-accompaniment combinations, constraining audio quality. - Dual-track token methods achieve higher audio quality but predict each track independently, leading to vocal-accompaniment disharmony. - Interleaved prediction patterns drastically increase sequence length, limiting scalability and long-form song generation. - Inconsistent data quality and unreliable music annotations constrain training effectiveness.

Key Challenge: Mixed tokens ensure harmony but limit audio quality; dual-track tokens offer better audio quality but suffer from poor harmony. Scarcity of high-quality data further restricts musicality and instruction-following capability.

Goal: Simultaneously optimize song generation in terms of audio quality, musicality, instruction-following capability, and vocal-accompaniment harmony.

Key Insight: Parallel modeling of mixed tokens and dual-track tokens, combined with modular progressive training and multi-preference DPO alignment.

Core Idea: Mixed tokens govern global harmony while dual-track tokens refine audio quality; the two operate in parallel without interfering with each other.

Method¶

Overall Architecture¶

LeVo = LeLM (language model) + Music Codec

LeLM consists of a large language model (predicting mixed tokens) and an AR decoder (predicting dual-track tokens). The Music Codec extends MuCodec: an encoder extracts tokens, and a decoder (diffusion Transformer + VAE decoder) reconstructs high-fidelity audio from tokens.

Key Designs¶

LeLM Two-Level Architecture:
- Language Model (decoder-only Transformer): focuses on next-token prediction of mixed tokens, capturing high-level structural information such as melody, rhythm, and beat, ensuring vocal-accompaniment harmony: \(p(\mathbf{S}_m | \mathbf{C}; \boldsymbol{\theta}) = \prod_{t=0}^T p(\mathbf{S}_{m,t} | \mathbf{S}_{m,<t}, \mathbf{C}; \boldsymbol{\theta})\)
- AR Decoder (a shallower decoder-only Transformer): conditioned on the language model's hidden states, it predicts vocal and accompaniment tokens in parallel. A delay pattern is introduced so that when predicting step \(t\), the dual-track tokens can attend to the language model's future \(k\) steps: \(p(\mathbf{S}_v, \mathbf{S}_a | \mathbf{C}; \boldsymbol{\theta}) = \prod_{t=0}^{T-k} p(\mathbf{S}_{v,t}, \mathbf{S}_{a,t} | \mathbf{S}_{v,<t}, \mathbf{S}_{a,<t}, \mathbf{S}_{m,<t+k}, \mathbf{C}; \boldsymbol{\theta})\)
Music Codec Design:
- Encoder = MuEncoder (extracts music-relevant representations) + RVQ (quantizes into tokens)
- Decoder = diffusion Transformer (reconstructs VAE features from token embeddings) + VAE decoder (directly generates audio)
- Two token strategies: mixed tokens (processing the full song) and dual-track tokens (separating vocals and accompaniment before encoding each independently)
DPO-Based Multi-Preference Alignment:
- Data Construction: An LLM generates 20,000 lyrics; each is paired with a random audio prompt and text description, and multiple samples are generated per entry.
- Strategy 1 — Lyrics Alignment Preference: Phoneme error rate (PER) is computed via ASR; pairs with a PER gap exceeding 40 are used as preference pairs.
- Strategy 2 — Prompt Consistency Preference: The MuQ-MuLan model computes similarity scores; threshold-based filtering selects preference pairs.
- Strategy 3 — Musicality Preference: A three-stage pipeline — crowdsourced ranking → reward model training → large-scale filtering — yields approximately 60,000 preference pairs.
- Fusion Method — Deep Network Interpolation (DNI): Three sets of DPO-fine-tuned parameters are obtained separately on each preference dataset and linearly interpolated into the final model, supporting controllable coefficient adjustment.

Loss & Training¶

Three-Stage Training Paradigm:

Stage 1 — Pre-training: The language model is trained to align conditioning inputs with mixed tokens; the AR decoder is frozen; audio prompts and text descriptions are each randomly dropped with 50% probability. This stage establishes generation diversity and vocal-accompaniment harmony.
Stage 2 — Modular Progressive Training: The AR decoder is trained to model dual-track tokens; all Stage 1 modules are frozen. This improves audio quality and musicality without disturbing pre-trained knowledge.
Stage 3 — Multi-Preference Alignment: The full LeLM is fine-tuned using DPO loss on multi-dimensional preference data, substantially enhancing musicality and instruction-following capability.

Model scale: LeLM ~2B parameters, MuEncoder 300M parameters, diffusion model ~700M parameters, VAE 150M parameters. Training data: 2 million songs (~110,000 hours).

Key Experimental Results¶

Main Results¶

Objective metric comparison (open-source and closed-source systems):

Model	FAD ↓	MuQ-T ↑	MuQ-A ↑	PER ↓	CE ↑	CU ↑	PQ ↑
Suno-V4.5	2.59	0.34	0.84	21.6	7.65	7.86	8.35
Mureka-O1	2.50	0.33	0.87	7.2	7.71	7.83	8.44
YuE	2.65	0.27	0.74	36.4	7.13	7.39	7.77
SongGen*	2.68	0.25	0.80	27.5	7.63	7.79	8.37
LeVo	2.68	0.34	0.83	7.2	7.78	7.90	8.46

Subjective MOS (1–5 scale):

Model	OVL	MEL	HAM	SSC	AQ	LYC
Suno-V4.5	3.59	4.10	3.93	4.19	4.00	3.17
Mureka-O1	3.42	3.88	3.89	4.14	3.87	3.32
LeVo	3.42	3.93	3.90	4.09	3.96	3.38
SongGen*	2.91	3.43	3.44	3.66	3.69	2.84

Ablation Study¶

Comparison of DPO multi-preference alignment strategies:

Method	FAD ↓	MuQ-T ↑	PER ↓	CE ↑	PQ ↑
w/o DPO	2.60	0.31	10.6	7.70	8.39
Strategy 1 only (lyrics alignment)	2.85	0.30	6.5	7.72	8.42
Strategy 2 only (prompt consistency)	2.89	0.34	10.3	7.75	8.43
Strategy 3 only (musicality)	2.63	0.32	11.2	7.78	8.45
Joint training	2.75	0.33	7.5	7.76	8.43
LeVo (interpolation)	2.68	0.34	7.2	7.78	8.46

Key Findings¶

Removing Stage 2 (modular progressive training) or the AR decoder both lead to performance degradation, validating the necessity of preventing interference between mixed and dual-track tokens.
The interpolation fusion method outperforms naive joint training; each strategy contributes distinctly, and the approach supports smooth preference transitions.
LeVo surpasses Suno-V4.5 on lyrics alignment (LYC) by +0.21 MOS.
LeVo achieves comprehensive superiority among open-source models and competitive performance compared to closed-source industrial systems.

Highlights & Insights¶

Mixed + dual-track parallel modeling paradigm: Elegantly resolves the tension between harmony and audio quality without increasing sequence length.
Modular progressive training strategy: Freezing pre-trained modules before training new ones effectively prevents catastrophic forgetting and token interference.
First application of multi-preference DPO to song generation: Three preference data construction strategies each exhibit unique designs.
DNI parameter interpolation: Not only combines multiple preferences but also provides controllable preference weight adjustment.
Industrial-grade benchmarking: The first academic open-source method to achieve performance comparable to industrial systems such as Suno.

Limitations & Future Work¶

Audio quality remains constrained by inconsistent training data quality and the information bottleneck of discrete tokens.
Song structure modeling (verse/chorus, etc.) remains inferior to Suno and Mureka.
The precision of pseudo-label annotations (text descriptions generated by Qwen2-Audio) is limited, capping the upper bound of instruction-following performance.
Training data are not released due to copyright restrictions, limiting reproducibility.
Style transfer and end-to-end generation capabilities may be exploited for deepfake applications.

Jukebox (Dhariwal et al., 2020): Pioneered the mixed-token LM paradigm for song generation.
YuE (Yuan et al., 2025): Introduced the dual-track token strategy.
SongGen (Liu et al., 2025): Explored interleaved prediction patterns.
MusicRL (Cideron et al., 2024): Applied RLHF to music generation.
Tango2 (Majumder et al., 2024): Pioneering work on DPO for audio generation.
The adaptation of Deep Network Interpolation (DNI) to the music domain represents a technical direction worthy of further attention.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The mixed + dual-track parallel modeling and multi-preference DPO design demonstrate significant originality.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 12 comparison systems, complete subjective and objective evaluation, and detailed ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear structure with thorough technical descriptions.
Value: ⭐⭐⭐⭐⭐ A milestone contribution to open-source song generation that substantially narrows the gap between academic research and industrial systems.