Skip to content

SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition

Conference: ACL 2025
arXiv: 2402.17645
Code: Project Page
Area: LLM/NLP
Keywords: Song composition, Large Language Model, Melody generation, Lyric generation, Symbolic music representation

TL;DR

SongComposer is the first music-specific Large Language Model capable of simultaneously generating lyrics and melodies. Utilizing a word-level aligned tuple format, a music-knowledge-based scalar pitch initialization, and progressive structure-aware training (motif -> independent song -> phrase-level pairing), it comprehensively outperforms GPT-4 on tasks including lyric-to-melody, melody-to-lyric, song continuation, and text-to-song generation.


Background & Motivation

Background: Symbolic song composition aims to generate vocal tracks consisting of lyrics and melodies as symbolic sequences, serving as a core task of song generation. While subtasks such as lyric generation, melody generation, lyric-to-melody, and melody-to-lyric have recently progressed individually, there is a lack of a unified framework capable of handling both lyrics and melodies simultaneously.

Limitations of Prior Work: (1) Traditional models like SongMASS and TeleMelody can only handle a single subtask, failing to cover all composition needs within a single model; (2) Directly using LLMs for song composition faces three major challenges: unexplored alignment methods for lyrics and melodies, difficulty in modeling hierarchical structures (motifs and phrases) of songs, and scarcity of high-quality paired datasets.

Design Motivation: The symbolic representation of songs shares structural similarities with natural language, and the instruction-following capabilities of LLMs allow the integration of multiple subtasks into a single model. However, three key technical challenges must be addressed: symbolic representation design, pitch understanding, and structural modeling.


Method

Overall Architecture

SongComposer is built upon InternLM2-7B and introduces three core innovations: a word-level tuple representation format, scalar pitch initialization, and three-stage progressive training. The training data is derived from the custom-built SongCompose dataset (280K lyric-only + 20K melody-only + 8K paired data).

Key Designs

1. Word-level Tuple Format: The melody is decomposed into three attributes: pitch \(p\), note duration \(d\), and rest duration \(r\). In the paired data, each note is aligned with its corresponding word in the <p>,d,w format. Durations are quantized using a unit of 1/16 beat: $\(d_k = \phi\left(\frac{\text{bpm}}{60}(\text{note-end}_k - \text{note-start}_k) \times 16\right)\)$

2. Scalar Pitch Initialization: The central pitch <66> is first initialized with a Gaussian distribution, while the remaining pitches are defined as scalar multiples of this central embedding: $\(\text{emb}(\langle p \rangle) = c_p \cdot \text{emb}(\langle 66 \rangle)\)$ where the scaling factor range is \([-\ln(e+17), \cdots, -\ln(e), \ln(e), \cdots, \ln(e+17)]\), explicitly encoding the numerical relationships among pitches.

3. Progressive Structure-Aware Training: - Stage 1 - Motif-level Melody Training: Highly repetitive short note sequences are extracted as motif data to allow the model to learn basic repetition patterns. - Stage 2 - Independent Full-song Training: The model is trained on lyric-only and melody-only datasets respectively to develop full-song comprehension. - Stage 3 - Phrase-level Paired Training: Five phrase-specific tokens (intro/verse/chorus/bridge/outro) are introduced into the paired data, allowing the model to learn the structural sections of songs.

Loss & Training

The standard autoregressive next-token prediction loss is employed to maximize the log-likelihood of tokens given their context.


Experiments

Main Results: Lyric-to-Melody and Melody-to-Lyric

Method PD(%)↑ DD(%)↑ MD↓ Cosine Dist.↑ ROUGE-2↑ BS↑
SongMASS 30.34 48.98 2.95 0.568 0.204 0.532
TeleMelody 46.81 51.77 2.60 - - -
GPT-3.5 31.24 38.52 3.01 0.641 0.142 0.603
GPT-4 36.43 42.94 2.87 0.654 0.158 0.610
SongComposer 50.75 57.71 2.20 0.697 0.234 0.657

Ablation Study

Ablation Dimension Detailed Results
Pitch Initialization Scalar > Interpolation > Average > Gaussian (MD: 2.33 vs 3.07/3.41/2.90)
Motif Repetition Threshold Optimal RR-MD balance is achieved at a threshold of 10 (RR=2.03, MD=2.33)
Alignment Granularity Word-level > Line-level > Song-level (MD: 2.12 vs 2.42 vs 3.71)
Phrase-level Tokens Continuation performs better with phrase tokens (MD: 2.12 vs 2.58, BS: 0.662 vs 0.612)

Subjective Evaluation

Method Lyric-to-Melody HMY. MLC. Text-to-Song OVL. REL.
GPT-3.5 1.68 1.88 2.53 2.95
GPT-4 2.82 2.79 2.43 3.27
SongComposer 3.82 3.76 3.41 3.88

Key Findings

  • Effectiveness of a Unified Framework: A single model comprehensively outperforms GPT-4 and specialized models across all four subtasks.
  • Critical Role of Scalar Initialization: Explicitly encoding numerical relations among pitches achieves significantly better performance than random initialization, improving the Recall Rate from 1.44 to 2.03.
  • Necessity of Word-level Alignment: Compared to line-level and song-level alignments, word-level alignment reduces Melody Distance from 3.71 to 2.12.
  • Effectiveness of Structure Awareness: Motif training enhances the repetitiveness and structural sense of melodies, while phrase tokens improve subsection organization capability.

Highlights & Insights

  • The first to utilize an LLM to simultaneously generate both lyrics and melodies, achieving a unified song composition framework.
  • Scalar pitch initialization delicately leverages numerical relations to encode pitch semantics, essentially serving as a form of music knowledge injection.
  • The progressive training strategy introduces structural information incrementally from motif to phrase levels, aligning with human compositional cognition.
  • The custom-built SongCompose dataset (containing 8K highly precise word-level aligned paired data) fills a major data vacancy in this domain.

Limitations & Future Work

  • The pitch range is restricted to C3-B5 (MIDI 48-83), which fails to cover composition requirements for wider vocal ranges.
  • Being based on InternLM2-7B, the model scale limits the complexity and diversity of generated compositions.
  • The paired dataset only contains 8,000 songs, which remains relatively limited in scale.
  • Evaluation relies heavily on melody similarity metrics, lacking sufficient assessment of musical creativity and musicality.
  • Only single-voice (vocal track) generation is supported, lacking the ability to generate polyphonic content such as accompaniment.
  • Song Subtasks: SongMASS (Sheng et al., 2021) bidirectional Lyric↔Melody generation, TeleMelody (Ju et al., 2022) template-based melody generation.
  • LLM-based Music Generation: ChatMusician (Yuan et al., 2024) symbolic music-only generation, whereas ours extends to joint lyric and melody generation.
  • Symbolic Representations: REMI (Huang & Yang, 2020) beat-based musical representation.
  • Paired Datasets: M4Singer (Zhang et al., 2022b) provides approximately 700 Chinese songs.

Rating

Dimension Score
Novelty ⭐⭐⭐⭐⭐
Technical Depth ⭐⭐⭐⭐
Experimental Thoroughness ⭐⭐⭐⭐⭐
Writing Quality ⭐⭐⭐⭐
Value ⭐⭐⭐⭐
Overall Rating 8/10