SongEcho: Towards Cover Song Generation via Instance-Adaptive Element-wise Linear Modulation¶

Conference: ICLR 2026
arXiv: 2602.19976
Code: GitHub
Area: Image Generation
Keywords: Cover song generation, FiLM, element-wise linear modulation, melody control, parameter efficiency

TL;DR¶

This paper proposes the SongEcho framework, which achieves cover song generation via Instance-Adaptive Element-wise Linear Modulation (IA-EiLM), generating new vocals and accompaniment while preserving the melodic contour of the original song.

Background & Motivation¶

Cover songs are an important part of musical culture, retaining the core melody of the original while infusing new emotional depth and themes. However:

Cover song generation is underexplored: While melody-guided instrumental generation exists, simultaneous generation of new vocals and accompaniment for cover songs remains largely unaddressed.

Limitations of existing conditioning mechanisms: - Cross-attention requires additional modeling of temporal alignment, which is indirect and introduces computational redundancy. - Element-wise addition exploits temporal correspondence but has limited flexibility (affine transformation with fixed scaling factors).

Lack of adaptability in conditional representations: Existing methods encode melody conditions independently, ignoring compatibility with the hidden states of the generative model.

Method¶

Task Definition¶

Cover song generation is reformulated as a conditional generation task: given the melody of an original vocal track and a text prompt, simultaneously generate new vocals and a harmonious accompaniment.

1. Element-wise Linear Modulation (EiLM)¶

Feature-wise Linear Modulation (FiLM) is extended to Element-wise Linear Modulation:

\[h_i^m = \text{EiLM}(h_i | c) = \gamma_i \odot h_i + \beta_i\]

where $(\gamma_i, \beta_i) = f_i(c)$, and the modulation parameters precisely match the shape of the hidden states $\gamma_i, \beta_i \in \mathbb{R}^{B \times T \times D_i}$.

Difference from FiLM: FiLM operates along the feature dimension, whereas EiLM operates across all dimensions (including time), enabling element-wise modulation that ensures temporally aligned injection of melody.

Core Idea: Conditional features should dynamically adapt based on the hidden states of the generative model.

\[h'_i = L_{h_i}(h_i), \quad m'_i = L_{m_i}(m)$$ $$c_i = \tanh(h'_i) \odot \tanh(m'_i)\]

A gating mechanism (inspired by WaveNet) enables interaction between hidden states and melody conditions, producing instance-adaptive conditional representations.

Theoretical Motivation: Static condition mapping suffers from an under-constrained many-to-one mapping problem. IACR transforms this into a one-to-one mapping by providing direct access to the hidden state $h$.

3. SongEcho Framework¶

Built upon ACE-Step (a text-to-song model)
Pitch extraction: RVMPE (100 Hz)
Melody encoder: 1D convolutional layers
IA-EiLM modules integrated before the FFN layer of each Transformer block
Zero initialization: $\text{EiLM-zero}(h_i|c_i) = (\gamma_i + 1) \odot h_i + \beta_i$, ensuring training starts from the original model behavior
Pretrained parameters are frozen; only IA-EiLM and the melody encoder are trained

4. Suno70k Dataset¶

To address the scarcity of full-song datasets, an AI-generated song dataset of 69,469 songs is constructed: - Filtered from 659K songs on Suno.ai - Quality assessed across five dimensions using SongEval - Enriched annotations generated by Qwen2-audio

Key Experimental Results¶

Baselines¶

ACE-Step + SA ControlNet (1.6B trainable parameters)
ACE-Step + SA ControlNet + LoRA (331M)
ACE-Step + MuseControlLite (188M)
SongEcho (49M, approximately 3% of ControlNet parameters)

Main Results (Suno70k Test Set)¶

Method	RPA↑	RCA↑	OA↑	CLAP↑	FD↓	KL↓	PER↓	Params
ACE-Step (original)	-	-	-	0.293	73.5	0.267	0.417	-
+SA ControlNet	0.621	0.644	0.686	0.288	106.0	0.202	0.371	1.6B
+MuseControlLite	0.521	-	-	-	-	-	-	188M
SongEcho	Best	Best	Best	Best	Best	Best	Best	49M

Ablation Study¶

Configuration	RPA	CLAP	FD
EiLM only (w/o IACR)	Lower	Lower	Higher
Addition injection only	Lower	Lower	Higher
Cross-attention only	Lower	Lower	Higher
IA-EiLM (full)	Best	Best	Best

Highlights & Insights¶

Exceptional parameter efficiency: SongEcho surpasses all baselines with fewer than 3% of ControlNet's parameters.
Unified conditioning paradigm: EiLM combines the advantages of both additive and attention-based approaches.
Well-motivated IACR: Clear theoretical analysis framing the problem as a transition from under-constrained to one-to-one mapping.
Release of Suno70k: A high-quality open-source song dataset.

Limitations & Future Work¶

Training on AI-generated songs leaves generalization to real-world recordings insufficiently evaluated.
The definition of cover songs is narrow (global style transfer + melody preservation), without support for locally customized adaptations.
Generation is constrained by the base model ACE-Step's 4-minute limit.
Melody control relies on pitch sequences, without considering richer musical control dimensions such as rhythmic variation.

Text-to-song: Jukebox, Suno, DiffRhythm, ACE-Step
Singing voice synthesis/conversion: SVS and SVC series
Controllable music generation: ControlNet, MuseControlLite
Conditional normalization: FiLM, AdaIN, TFiLM

Rating¶

Novelty: ⭐⭐⭐⭐ — The EiLM+IACR combination is novel, with well-grounded theoretical motivation for IACR.
Practicality: ⭐⭐⭐⭐ — Parameter-efficient with high output quality; strong practical applicability.
Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-dataset evaluation with thorough ablation studies.
Writing Quality: ⭐⭐⭐⭐ — Clear structure with well-articulated motivation.