MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners¶

Conference: ICML 2025
arXiv: 2506.18729
Code: https://MuseControlLite.github.io/web/
Area: Diffusion Models
Keywords: music generation, controllable generation, diffusion transformer, decoupled cross-attention, positional embedding

TL;DR¶

This work proposes MuseControlLite, which introduces Rotary Position Embedding (RoPE) into decoupled cross-attention layers. This enables precise time-varying conditional control for text-to-music generation with only 85M trainable parameters (6.75x fewer than ControlNet), while pioneering unified support for both music attribute control and audio inpainting/outpainting.

Background & Motivation¶

Background: Text-to-music generation models have made significant progress, but there is an increasing user demand for fine-grained control beyond text prompts—specifically, the ability to precisely manipulate time-varying music attributes such as melody, rhythm, and loudness dynamics. Methods like Music ControlNet, JASCO, and DITTO have begun to explore this direction.

Limitations of Prior Work: Current mainstream methods suffer from two core limitations. First, ControlNet-like methods require copying half of the diffusion model as a trainable replica, which involves a massive number of parameters (e.g., Stable Audio Open ControlNet requires 572M trainable parameters), leading to high training and inference costs. Second, existing fine-tuning methods either support only music attribute control (text+attribute) or only audio-conditional control (text+audio), failing to handle both types of conditions simultaneously.

Key Challenge: The contradiction between parameter efficiency and control precision—lightweight fine-tuning often sacrifices control capability, while high-precision control requires a massive number of extra parameters. A deeper issue is that existing methods neglect the positional information of time-varying conditions, making it difficult for the model to accurately align the conditional signals with the temporal positions of the generated content.

Goal: (1) How to achieve high-precision control over time-varying music attributes with minimal trainable parameters? (2) How to unify music attribute control and audio inpainting/outpainting capabilities within a single framework?

Key Insight: The authors observe that text-to-music models rarely use positional embeddings when processing text conditions, whereas time-varying conditions are functions of time, making positional information critical. The decoupled cross-attention mechanism based on IP-Adapter inherently has very few parameters, and introducing appropriate positional embeddings can significantly improve the control precision of time-varying conditions.

Core Idea: Incorporate Rotary Position Embedding (RoPE) into the decoupled cross-attention layers, enabling the lightweight adapter to perceive the temporal positions of time-varying conditions. This achieves superior music control performance with less than 1/7 of ControlNet's parameters.

Method¶

Overall Architecture¶

MuseControlLite is fine-tuned on top of Stable Audio Open (a 24-layer diffusion Transformer). The input includes text prompts, time-varying music attribute conditions (melody/rhythm/loudness), and/or audio reference conditions. The original model parameters are frozen, and only the newly added \(W'^k\) and \(W'^v\) matrices in the decoupled cross-attention layers, the feature extractors, and the zero-initialized 1D convolutional layers are trained. Overall, the newly added parameters account for only 8% of the backbone network (85M vs 1.3B). The music attribute adapter and the audio condition adapter are trained separately and can be used individually or jointly during inference.

Key Designs¶

Decoupled Cross-Attention with RoPE:
- Function: New cross-attention layers are added to handle time-varying conditions without modifying the original text cross-attention layers.
- Mechanism: Transfer the decoupled cross-attention of IP-Adapter from the image domain to the music domain. The key innovation is applying Rotary Position Embedding (RoPE) to the query, key, and value vectors. The text cross-attention output \(x_{\text{text}}\) and the attribute cross-attention output \(x_{\text{attr}}\) are summed and then passed through a zero-initialized 1D convolutional layer: \(x = Z_{\text{CNN}}(x_{\text{text}} + x_{\text{attr}})\). Only \(W'^k\) and \(W'^v\) (initialized from pretrained \(W^k\) and \(W^v\)) are trained, while the other parameters are frozen.
- Design Motivation: Experiments show that melody accuracy is only 10.7% without RoPE, but leaps to 58.6% with RoPE. This proves that positional encoding is crucial for learning time-varying conditions—the model must know the absolute and relative positions of the condition signals on the timeline to accurately map melody/rhythm/loudness to corresponding time intervals.
Multi-Condition Feature Extraction and Concatenation:
- Function: Unify different types of music attributes into a conditional representation that can be input into the decoupled cross-attention.
- Mechanism: Melody is extracted using Constant-Q Transform (CQT, 128 bins), keeping the 4 most prominent pitches via argmax followed by high-pass filtering. Loudness is calculated as dBs converted from spectral energy, followed by Savitzky-Golay filtering. Rhythm uses a recurrent neural network to detect downbeat and beat probabilities. These three conditions are expanded to \(C_r/3\) dimensions using individual 1D CNNs, aligned to the query length \(M\) via interpolation, and finally concatenated along the channel dimension as \(c_{\text{attr}} \in \mathbb{R}^{M \times C_r}\). During training, the three conditions are independently and randomly masked by 10%-90% so that the model learns to decouple each condition and can improvise on unconditioned segments.
- Design Motivation: The independent random masking strategy allows the model to flexibly combine any subset of conditions, achieving partial control—for instance, specifying the melody only for the 10-20 seconds range while the model automatically and naturally generates the preceding and succeeding segments.
Audio Conditions for Inpainting/Outpainting:
- Function: Support audio inpainting (reconstructing intermediate segments) and outpainting (continuing later segments), while enabling joint usage with music attribute control.
- Mechanism: Directly use the clean latent \(x_0\) encoded by the VAE as the audio condition \(c_{\text{audio}}\), and train an independent set of adapters (\(W''^k\), \(W''^v\)). Since the audio condition contains far more information than the attribute conditions, joint training would cause the model to ignore the attribute conditions; thus, the two adapter sets are trained separately. During inference, a complementary masking strategy is applied to \(c_{\text{audio}}\) and \(c_{\text{attr}}\) to ensure that any given time step is controlled by only one of the conditions.
- Design Motivation: Applying random masking to \(c_{\text{audio}}\) during training teaches the model not only to copy reference signals at the same position but also to infer and complete content from distant tokens, thereby generating smooth transitions at the inpainting/outpainting boundaries.

Loss & Training¶

The diffusion loss parameterized by v-prediction is adopted: \(\mathcal{L} = \mathbb{E}_{t,x_t} \|f_\theta(\alpha_t x_0 + \sigma_t \epsilon, t) - v_t\|_2^2\), where \(v_t = \alpha_t \epsilon + \beta_t x_0\). During training, the text condition is discarded with a 30% probability, and each attribute condition is independently discarded with a 50% probability. The model is trained with a batch size of 128, a learning rate of \(10^{-4}\), and a weight decay of \(10^{-2}\) for 40,000 steps on a single RTX 3090. Multiple classifier-free guidance is employed during inference, with independent guidance scales for text, attribute, and audio conditions: \(\lambda_{\text{text}}=7.0\), \(\lambda_{\text{attr}}=2.0\), and \(\lambda_{\text{audio}}=1.0\).

Key Experimental Results¶

Main Results¶

Model	Trainable Params	Total Params	Training Data	FD↓	KL↓	CLAP↑	Melody Accuracy↑
MusicGen-Stereo-Large-Melody	3.3B	3.3B	20K hr	193.66	0.436	0.354	43.1%
Stable Audio Open ControlNet	572M	1.9B	2.2K hr	97.73	0.265	0.396	56.6%
MuseControlLite-Melody (ours)	85M	1.4B	1.7K hr	76.42	0.289	0.372	61.1%
MuseControlLite-Attr (ours)	85M	1.4B	1.7K hr	80.79	0.271	0.373	60.6%

Ablation Study¶

Condition	Melody↑	Rhythm F1↑	Loudness Correlation↑
Text Only	0.09	0.21	0.05
+Melody	0.60	0.76	0.66
+Rhythm	0.09	0.89	0.42
+Loudness	0.09	0.30	0.92
All Attributes	0.61	0.90	0.95

RoPE Ablation (70K steps training):

Setting	FD↓	KL↓	CLAP↑	Melody Accuracy↑
Without RoPE	113.13	0.58	0.41	10.7%
With RoPE	78.50	0.29	0.38	58.6%

Key Findings¶

RoPE is the key to unlocking time-varying conditional control; without positional encoding, decoupled cross-attention can hardly learn new conditions.
With only 85M trainable parameters, it surpasses the 572M ControlNet, improving melody accuracy by 4.5 percentage points.
Independent control of each condition works well: specifying melody does not impair the rhythm control capability, and vice versa.
It performs exceptionally well in style-transfer scenarios (where text and attributes originate from different audio clips).

Highlights & Insights¶

The core finding is highly concise and powerful: merely adding positional encodings to decoupled cross-attention outperforms ControlNet with 1/7 of its parameters. This delivers an extremely efficient parameter-efficient fine-tuning paradigm for conditional control in the audio domain.
It is the first to unify music attribute control and audio inpainting/outpainting, utilizing a cleverly designed complementary masking strategy.
Multiple classifier-free guidance effectively prevents the model from overfitting to extra conditions, thereby avoiding the neglect of text semantics.

Limitations & Future Work¶

The model is trained and evaluated solely on instrumental data, without involving vocal control which is also critical in real-world music creation.
Audio and attribute conditions must be complementarily masked and cannot overlap, which limits more flexible joint control scenarios.
The generalization capability of RoPE under different audio lengths is not thoroughly explored, leaving its extrapolation performance beyond the training length unknown.
Evaluation relies mainly on objective metrics (FD, KL, CLAP, melody accuracy), lacking large-scale subjective auditory evaluation.
Fine-tuning is based only on Stable Audio Open, and its adaptability to autoregressive architectures like MusicGen remains unexplored.
The scale of the training data is limited (1.7K hr), showing a distinct gap compared to commercial models (20K+ hr).
The CLAP score drops (0.28 vs 0.42) when multiple attribute conditions are used jointly, indicating a trade-off between conditional control and text semantics.

The successful transfer of IP-Adapter (decoupled cross-attention) from images to audio demonstrates that parameter-efficient fine-tuning methods in the vision domain are also effective in the audio domain.
While RoPE is widely used in LLMs, it has been largely neglected in conditional generation. This work reveals the critical role of positional encoding in controlling time-varying conditions.
It provides a lightweight solution reference for time-varying conditional control in other modalities (such as video and 3D).

Rating¶

⭐⭐⭐⭐ The core idea is exceptionally simple and efficient (only adding RoPE to decoupled cross-attention), with experiments fully covering multiple tasks such as melody/rhythm/loudness and audio inpainting/outpainting. Incorporating this approach exhibits outstanding parameter-efficiency improvement (85M vs 572M parameters). The ablation studies clearly justify each design choice. However, the application scenarios are limited to instrumental music generation, and its stability over broader audio control scenarios (vocals, sound effects) remains to be validated. Overall, it stands as a solid and practical contribution to the field of controllable music generation.