SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition¶

Conference: ACL 2025
arXiv: 2402.17645
Code: Project Page
Area: LLM/NLP
Keywords: Song composition, Large Language Model, Melody generation, Lyric generation, Symbolic music representation

TL;DR¶

SongComposer is the first music-specific Large Language Model capable of simultaneously generating lyrics and melodies. Utilizing a word-level aligned tuple format, a music-knowledge-based scalar pitch initialization, and progressive structure-aware training (motif -> independent song -> phrase-level pairing), it comprehensively outperforms GPT-4 on tasks including lyric-to-melody, melody-to-lyric, song continuation, and text-to-song generation.

Background & Motivation¶

Background: Symbolic song composition aims to generate vocal tracks consisting of lyrics and melodies as symbolic sequences, serving as a core task of song generation. While subtasks such as lyric generation, melody generation, lyric-to-melody, and melody-to-lyric have recently progressed individually, there is a lack of a unified framework capable of handling both lyrics and melodies simultaneously.

Limitations of Prior Work: (1) Traditional models like SongMASS and TeleMelody can only handle a single subtask, failing to cover all composition needs within a single model; (2) Directly using LLMs for song composition faces three major challenges: unexplored alignment methods for lyrics and melodies, difficulty in modeling hierarchical structures (motifs and phrases) of songs, and scarcity of high-quality paired datasets.

Design Motivation: The symbolic representation of songs shares structural similarities with natural language, and the instruction-following capabilities of LLMs allow the integration of multiple subtasks into a single model. However, three key technical challenges must be addressed: symbolic representation design, pitch understanding, and structural modeling.

Method¶

Overall Architecture¶

SongComposer is built upon InternLM2-7B and introduces three core innovations: a word-level tuple representation format, scalar pitch initialization, and three-stage progressive training. The training data is derived from the custom-built SongCompose dataset (280K lyric-only + 20K melody-only + 8K paired data).

Key Designs¶

1. Word-level Tuple Format: The melody is decomposed into three attributes: pitch $p$, note duration $d$, and rest duration $r$. In the paired data, each note is aligned with its corresponding word in the <p>,d,w format. Durations are quantized using a unit of 1/16 beat: $$d_k = \phi\left(\frac{\text{bpm}}{60}(\text{note-end}_k - \text{note-start}_k) \times 16\right)$$

2. Scalar Pitch Initialization: The central pitch <66> is first initialized with a Gaussian distribution, while the remaining pitches are defined as scalar multiples of this central embedding: $$\text{emb}(\langle p \rangle) = c_p \cdot \text{emb}(\langle 66 \rangle)$$ where the scaling factor range is $[-\ln(e+17), \cdots, -\ln(e), \ln(e), \cdots, \ln(e+17)]$, explicitly encoding the numerical relationships among pitches.

3. Progressive Structure-Aware Training: - Stage 1 - Motif-level Melody Training: Highly repetitive short note sequences are extracted as motif data to allow the model to learn basic repetition patterns. - Stage 2 - Independent Full-song Training: The model is trained on lyric-only and melody-only datasets respectively to develop full-song comprehension. - Stage 3 - Phrase-level Paired Training: Five phrase-specific tokens (intro/verse/chorus/bridge/outro) are introduced into the paired data, allowing the model to learn the structural sections of songs.

Loss & Training¶

The standard autoregressive next-token prediction loss is employed to maximize the log-likelihood of tokens given their context.

Experiments¶

Main Results: Lyric-to-Melody and Melody-to-Lyric¶

Method	PD(%)↑	DD(%)↑	MD↓	Cosine Dist.↑	ROUGE-2↑	BS↑
SongMASS	30.34	48.98	2.95	0.568	0.204	0.532
TeleMelody	46.81	51.77	2.60	-	-	-
GPT-3.5	31.24	38.52	3.01	0.641	0.142	0.603
GPT-4	36.43	42.94	2.87	0.654	0.158	0.610
SongComposer	50.75	57.71	2.20	0.697	0.234	0.657

Ablation Study¶

Ablation Dimension	Detailed Results
Pitch Initialization	Scalar > Interpolation > Average > Gaussian (MD: 2.33 vs 3.07/3.41/2.90)
Motif Repetition Threshold	Optimal RR-MD balance is achieved at a threshold of 10 (RR=2.03, MD=2.33)
Alignment Granularity	Word-level > Line-level > Song-level (MD: 2.12 vs 2.42 vs 3.71)
Phrase-level Tokens	Continuation performs better with phrase tokens (MD: 2.12 vs 2.58, BS: 0.662 vs 0.612)

Subjective Evaluation¶

Method	Lyric-to-Melody HMY.	MLC.	Text-to-Song OVL.	REL.
GPT-3.5	1.68	1.88	2.53	2.95
GPT-4	2.82	2.79	2.43	3.27
SongComposer	3.82	3.76	3.41	3.88

Key Findings¶

Effectiveness of a Unified Framework: A single model comprehensively outperforms GPT-4 and specialized models across all four subtasks.
Critical Role of Scalar Initialization: Explicitly encoding numerical relations among pitches achieves significantly better performance than random initialization, improving the Recall Rate from 1.44 to 2.03.
Necessity of Word-level Alignment: Compared to line-level and song-level alignments, word-level alignment reduces Melody Distance from 3.71 to 2.12.
Effectiveness of Structure Awareness: Motif training enhances the repetitiveness and structural sense of melodies, while phrase tokens improve subsection organization capability.

Highlights & Insights¶

The first to utilize an LLM to simultaneously generate both lyrics and melodies, achieving a unified song composition framework.
Scalar pitch initialization delicately leverages numerical relations to encode pitch semantics, essentially serving as a form of music knowledge injection.
The progressive training strategy introduces structural information incrementally from motif to phrase levels, aligning with human compositional cognition.
The custom-built SongCompose dataset (containing 8K highly precise word-level aligned paired data) fills a major data vacancy in this domain.

Limitations & Future Work¶

The pitch range is restricted to C3-B5 (MIDI 48-83), which fails to cover composition requirements for wider vocal ranges.
Being based on InternLM2-7B, the model scale limits the complexity and diversity of generated compositions.
The paired dataset only contains 8,000 songs, which remains relatively limited in scale.
Evaluation relies heavily on melody similarity metrics, lacking sufficient assessment of musical creativity and musicality.
Only single-voice (vocal track) generation is supported, lacking the ability to generate polyphonic content such as accompaniment.

Song Subtasks: SongMASS (Sheng et al., 2021) bidirectional Lyric↔Melody generation, TeleMelody (Ju et al., 2022) template-based melody generation.
LLM-based Music Generation: ChatMusician (Yuan et al., 2024) symbolic music-only generation, whereas ours extends to joint lyric and melody generation.
Symbolic Representations: REMI (Huang & Yang, 2020) beat-based musical representation.
Paired Datasets: M4Singer (Zhang et al., 2022b) provides approximately 700 Chinese songs.

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐⭐
Technical Depth	⭐⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Value	⭐⭐⭐⭐
Overall Rating	8/10