SongEcho: Towards Cover Song Generation via Instance-Adaptive Element-wise Linear Modulation¶
Conference: ICLR 2026
arXiv: 2602.19976
Code: GitHub
Area: Image Generation
Keywords: Cover song generation, FiLM, element-wise linear modulation, melody control, parameter efficiency
TL;DR¶
This paper proposes the SongEcho framework, which achieves cover song generation via Instance-Adaptive Element-wise Linear Modulation (IA-EiLM), generating new vocals and accompaniment while preserving the melodic contour of the original song.
Background & Motivation¶
Cover songs are an important part of musical culture, retaining the core melody of the original while infusing new emotional depth and themes. However:
Cover song generation is underexplored: While melody-guided instrumental generation exists, simultaneous generation of new vocals and accompaniment for cover songs remains largely unaddressed.
Limitations of existing conditioning mechanisms: - Cross-attention requires additional modeling of temporal alignment, which is indirect and introduces computational redundancy. - Element-wise addition exploits temporal correspondence but has limited flexibility (affine transformation with fixed scaling factors).
Lack of adaptability in conditional representations: Existing methods encode melody conditions independently, ignoring compatibility with the hidden states of the generative model.
Method¶
Task Definition¶
Cover song generation is reformulated as a conditional generation task: given the melody of an original vocal track and a text prompt, simultaneously generate new vocals and a harmonious accompaniment.
1. Element-wise Linear Modulation (EiLM)¶
Feature-wise Linear Modulation (FiLM) is extended to Element-wise Linear Modulation:
where \((\gamma_i, \beta_i) = f_i(c)\), and the modulation parameters precisely match the shape of the hidden states \(\gamma_i, \beta_i \in \mathbb{R}^{B \times T \times D_i}\).
Difference from FiLM: FiLM operates along the feature dimension, whereas EiLM operates across all dimensions (including time), enabling element-wise modulation that ensures temporally aligned injection of melody.
2. Instance-Adaptive Condition Refinement (IACR)¶
Core Idea: Conditional features should dynamically adapt based on the hidden states of the generative model.
A gating mechanism (inspired by WaveNet) enables interaction between hidden states and melody conditions, producing instance-adaptive conditional representations.
Theoretical Motivation: Static condition mapping suffers from an under-constrained many-to-one mapping problem. IACR transforms this into a one-to-one mapping by providing direct access to the hidden state \(h\).
3. SongEcho Framework¶
- Built upon ACE-Step (a text-to-song model)
- Pitch extraction: RVMPE (100 Hz)
- Melody encoder: 1D convolutional layers
- IA-EiLM modules integrated before the FFN layer of each Transformer block
- Zero initialization: \(\text{EiLM-zero}(h_i|c_i) = (\gamma_i + 1) \odot h_i + \beta_i\), ensuring training starts from the original model behavior
- Pretrained parameters are frozen; only IA-EiLM and the melody encoder are trained
4. Suno70k Dataset¶
To address the scarcity of full-song datasets, an AI-generated song dataset of 69,469 songs is constructed: - Filtered from 659K songs on Suno.ai - Quality assessed across five dimensions using SongEval - Enriched annotations generated by Qwen2-audio
Key Experimental Results¶
Baselines¶
- ACE-Step + SA ControlNet (1.6B trainable parameters)
- ACE-Step + SA ControlNet + LoRA (331M)
- ACE-Step + MuseControlLite (188M)
- SongEcho (49M, approximately 3% of ControlNet parameters)
Main Results (Suno70k Test Set)¶
| Method | RPA↑ | RCA↑ | OA↑ | CLAP↑ | FD↓ | KL↓ | PER↓ | Params |
|---|---|---|---|---|---|---|---|---|
| ACE-Step (original) | - | - | - | 0.293 | 73.5 | 0.267 | 0.417 | - |
| +SA ControlNet | 0.621 | 0.644 | 0.686 | 0.288 | 106.0 | 0.202 | 0.371 | 1.6B |
| +MuseControlLite | 0.521 | - | - | - | - | - | - | 188M |
| SongEcho | Best | Best | Best | Best | Best | Best | Best | 49M |
Ablation Study¶
| Configuration | RPA | CLAP | FD |
|---|---|---|---|
| EiLM only (w/o IACR) | Lower | Lower | Higher |
| Addition injection only | Lower | Lower | Higher |
| Cross-attention only | Lower | Lower | Higher |
| IA-EiLM (full) | Best | Best | Best |
Highlights & Insights¶
- Exceptional parameter efficiency: SongEcho surpasses all baselines with fewer than 3% of ControlNet's parameters.
- Unified conditioning paradigm: EiLM combines the advantages of both additive and attention-based approaches.
- Well-motivated IACR: Clear theoretical analysis framing the problem as a transition from under-constrained to one-to-one mapping.
- Release of Suno70k: A high-quality open-source song dataset.
Limitations & Future Work¶
- Training on AI-generated songs leaves generalization to real-world recordings insufficiently evaluated.
- The definition of cover songs is narrow (global style transfer + melody preservation), without support for locally customized adaptations.
- Generation is constrained by the base model ACE-Step's 4-minute limit.
- Melody control relies on pitch sequences, without considering richer musical control dimensions such as rhythmic variation.
Related Work & Insights¶
- Text-to-song: Jukebox, Suno, DiffRhythm, ACE-Step
- Singing voice synthesis/conversion: SVS and SVC series
- Controllable music generation: ControlNet, MuseControlLite
- Conditional normalization: FiLM, AdaIN, TFiLM
Rating¶
- Novelty: ⭐⭐⭐⭐ — The EiLM+IACR combination is novel, with well-grounded theoretical motivation for IACR.
- Practicality: ⭐⭐⭐⭐ — Parameter-efficient with high output quality; strong practical applicability.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-dataset evaluation with thorough ablation studies.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure with well-articulated motivation.