Skip to content

SongEcho: Towards Cover Song Generation via Instance-Adaptive Element-wise Linear Modulation

Conference: ICLR 2026
arXiv: 2602.19976
Code: GitHub
Area: Image Generation
Keywords: Cover song generation, FiLM, element-wise linear modulation, melody control, parameter efficiency

TL;DR

This paper proposes the SongEcho framework, which achieves cover song generation via Instance-Adaptive Element-wise Linear Modulation (IA-EiLM), generating new vocals and accompaniment while preserving the melodic contour of the original song.

Background & Motivation

Cover songs are an important part of musical culture, retaining the core melody of the original while infusing new emotional depth and themes. However:

Cover song generation is underexplored: While melody-guided instrumental generation exists, simultaneous generation of new vocals and accompaniment for cover songs remains largely unaddressed.

Limitations of existing conditioning mechanisms: - Cross-attention requires additional modeling of temporal alignment, which is indirect and introduces computational redundancy. - Element-wise addition exploits temporal correspondence but has limited flexibility (affine transformation with fixed scaling factors).

Lack of adaptability in conditional representations: Existing methods encode melody conditions independently, ignoring compatibility with the hidden states of the generative model.

Method

Task Definition

Cover song generation is reformulated as a conditional generation task: given the melody of an original vocal track and a text prompt, simultaneously generate new vocals and a harmonious accompaniment.

1. Element-wise Linear Modulation (EiLM)

Feature-wise Linear Modulation (FiLM) is extended to Element-wise Linear Modulation:

\[h_i^m = \text{EiLM}(h_i | c) = \gamma_i \odot h_i + \beta_i\]

where \((\gamma_i, \beta_i) = f_i(c)\), and the modulation parameters precisely match the shape of the hidden states \(\gamma_i, \beta_i \in \mathbb{R}^{B \times T \times D_i}\).

Difference from FiLM: FiLM operates along the feature dimension, whereas EiLM operates across all dimensions (including time), enabling element-wise modulation that ensures temporally aligned injection of melody.

2. Instance-Adaptive Condition Refinement (IACR)

Core Idea: Conditional features should dynamically adapt based on the hidden states of the generative model.

\[h'_i = L_{h_i}(h_i), \quad m'_i = L_{m_i}(m)$$ $$c_i = \tanh(h'_i) \odot \tanh(m'_i)\]

A gating mechanism (inspired by WaveNet) enables interaction between hidden states and melody conditions, producing instance-adaptive conditional representations.

Theoretical Motivation: Static condition mapping suffers from an under-constrained many-to-one mapping problem. IACR transforms this into a one-to-one mapping by providing direct access to the hidden state \(h\).

3. SongEcho Framework

  • Built upon ACE-Step (a text-to-song model)
  • Pitch extraction: RVMPE (100 Hz)
  • Melody encoder: 1D convolutional layers
  • IA-EiLM modules integrated before the FFN layer of each Transformer block
  • Zero initialization: \(\text{EiLM-zero}(h_i|c_i) = (\gamma_i + 1) \odot h_i + \beta_i\), ensuring training starts from the original model behavior
  • Pretrained parameters are frozen; only IA-EiLM and the melody encoder are trained

4. Suno70k Dataset

To address the scarcity of full-song datasets, an AI-generated song dataset of 69,469 songs is constructed: - Filtered from 659K songs on Suno.ai - Quality assessed across five dimensions using SongEval - Enriched annotations generated by Qwen2-audio

Key Experimental Results

Baselines

  • ACE-Step + SA ControlNet (1.6B trainable parameters)
  • ACE-Step + SA ControlNet + LoRA (331M)
  • ACE-Step + MuseControlLite (188M)
  • SongEcho (49M, approximately 3% of ControlNet parameters)

Main Results (Suno70k Test Set)

Method RPA↑ RCA↑ OA↑ CLAP↑ FD↓ KL↓ PER↓ Params
ACE-Step (original) - - - 0.293 73.5 0.267 0.417 -
+SA ControlNet 0.621 0.644 0.686 0.288 106.0 0.202 0.371 1.6B
+MuseControlLite 0.521 - - - - - - 188M
SongEcho Best Best Best Best Best Best Best 49M

Ablation Study

Configuration RPA CLAP FD
EiLM only (w/o IACR) Lower Lower Higher
Addition injection only Lower Lower Higher
Cross-attention only Lower Lower Higher
IA-EiLM (full) Best Best Best

Highlights & Insights

  1. Exceptional parameter efficiency: SongEcho surpasses all baselines with fewer than 3% of ControlNet's parameters.
  2. Unified conditioning paradigm: EiLM combines the advantages of both additive and attention-based approaches.
  3. Well-motivated IACR: Clear theoretical analysis framing the problem as a transition from under-constrained to one-to-one mapping.
  4. Release of Suno70k: A high-quality open-source song dataset.

Limitations & Future Work

  1. Training on AI-generated songs leaves generalization to real-world recordings insufficiently evaluated.
  2. The definition of cover songs is narrow (global style transfer + melody preservation), without support for locally customized adaptations.
  3. Generation is constrained by the base model ACE-Step's 4-minute limit.
  4. Melody control relies on pitch sequences, without considering richer musical control dimensions such as rhythmic variation.
  • Text-to-song: Jukebox, Suno, DiffRhythm, ACE-Step
  • Singing voice synthesis/conversion: SVS and SVC series
  • Controllable music generation: ControlNet, MuseControlLite
  • Conditional normalization: FiLM, AdaIN, TFiLM

Rating

  • Novelty: ⭐⭐⭐⭐ — The EiLM+IACR combination is novel, with well-grounded theoretical motivation for IACR.
  • Practicality: ⭐⭐⭐⭐ — Parameter-efficient with high output quality; strong practical applicability.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-dataset evaluation with thorough ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure with well-articulated motivation.