CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction¶
Conference: ICML2026
arXiv: 2603.00610
Code: Available (GitHub; model weights CMI-RM and datasets CMI-Pref / CMI-Pref-Pseudo are open-sourced)
Area: Audio & Speech / Music Generation Evaluation / Reward Models
Keywords: Music Reward Model, Compositional Multimodal Instruction, Preference Dataset, Inference-time Scaling, RLHF
TL;DR¶
Addressing the lack of unified evaluation for modern music generation models that simultaneously process "text + lyrics + reference audio," this paper establishes a complete ecosystem: 110k pseudo-labeled CMI-Pref-Pseudo, 4,027 human-labeled CMI-Pref, a unified CMI-RewardBench, and a family of ~30M parameter reward models (CMI-RM) capable of handling all modality combinations in a single architecture. The authors demonstrate high correlation with human judgment and enable "inference-time scaling" via top-k filtering.
Background & Motivation¶
Background: AIGC music generation (e.g., Suno, Stable Audio, YUE, ACE-Step) has evolved to flexibly receive multimodal conditions—pure text, text + lyrics, text + reference audio, or combinations for style transfer/continuation. However, evaluation capabilities for these outputs lag significantly.
Limitations of Prior Work: Existing evaluation methods are fragmented and narrow. Distribution-level metrics (FAD, MAD, KAD) only compare at the corpus level and cannot provide sample-level signals needed for post-training/filtering. Sample-level MOS predictors (PAM, Audiobox, SongEval) only assess "musicality." Alignment metrics (CLAP, CLaMP3, MuQ-MuLan) almost exclusively cover text-to-audio, neglecting lyrics and audio prompts. Closed-source systems (MusicRL, DRAGON) also lack reproducibility.
Key Challenge: There is a widening gap between model capabilities (flexible combined inputs) and evaluation methods (rigid input assumptions, single-dimension scoring). More fundamentally, data is scarce—large-scale interaction data from recommender systems captures "user-track affinity" (global style preference) rather than "generation alignment" (fine-grained sample-wise comparison for complex multimodal instructions), the latter of which is required for training alignment models.
Goal: Define and measure "compositional alignment"—not just satisfying multiple constraints simultaneously, but requiring a unified model that adaptively aligns with human preferences under optional and varying input conditions (text, lyrics, or audio reference). Scoring/ranking must stably reflect human judgment of music quality and instruction following.
Key Insight: Rather than continuing to build narrow specialized scorers, it is better to first fill the missing data foundation and unified benchmark, then train a parameter-efficient reward model with a single architecture for all modality combinations. The authors found that even frontier multimodal LLMs like Gemini-2.5-Pro struggle to exceed 80% agreement with humans on this benchmark, highlighting a real and unsolved capability gap.
Core Idea: Unify music evaluation scenarios using "Compositional Multimodal Instructions (CMI)," supported by a construction of preference data + unified benchmark + two-tower parameter-efficient reward models, allowing a ~30M model to serve as a human proxy for both musicality and alignment.
Method¶
Overall Architecture¶
CMI-RM aims to solve: given a compositional prompt \(\mathcal{P}=(t,l,a_{\text{ref}})\) (optional text \(t\), optional lyrics \(l\), optional reference audio \(a_{\text{ref}}\)) and an evaluation audio \(a_{\text{eval}}\), output two scalar scores—musicality \(s_{\text{MUS}}\) and alignment \(s_{\text{ALI}}\)—to match human judgment. The architecture consists of three layers: a preference data foundation (large-scale pseudo + small-scale high-quality human), a two-tower dual-Transformer fusion architecture (frozen MuQ-MuLan encoders → 30M trainable), and two-stage training (Pseudo-label BT pre-training → Expert fine-tuning). Finally, this RM is evaluated on CMI-RewardBench and used for top-k filtering during inference.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Compositional Prompt (t, l, a_ref)<br/>+ Audio to Eval a_eval"] --> B["CMI Data Foundation<br/>CMI-Pref-Pseudo 110k + CMI-Pref 4,027"]
B --> C["Two-tower + Dual Transformer Fusion<br/>Frozen MuQ-MuLan Encoders → 30M Trainable"]
C --> D["Two-stage Training<br/>Pseudo-label BT Pre-training → Expert Fine-tuning"]
D --> E["CMI-RewardBench Unified Benchmark<br/>+ top-k Inference-time Scaling"]
E --> F["Output (s_MUS, s_ALI)"]
Key Designs¶
1. Data Foundation: CMI-Pref-Pseudo + CMI-Pref
Training alignment models lacks data scale, quality, and modality coverage. The authors address this via two complementary datasets: CMI-Pref-Pseudo distills diverse generations from 12 open-source models + 11 commercial APIs (Suno v3.5–v5, Stable Audio, etc.), with 35.6% involving audio prompts for style transfer/continuation; Qwen3-Omni serves as an LLM-judge for two-dimensional preferences (musicality + alignment), retaining 110k pairs after consistency filtering. CMI-Pref consists of 4,027 high-quality pairs from 31 experts, providing preferences for both dimensions, confidence scores (1–5), and rationales. It includes a balanced 500-pair test set (text / text+lyrics / text+audio / text+audio+lyrics) with human agreement of 75.2% (MUS) and 75.0% (ALI). This is the only preference data covering both "lyric conditions" and "audio-to-audio conditions."
2. Parameter-Efficient Fusion Architecture (~30M)
To adapt to "any modality optional" inputs, CMI-RM uses a two-tower structure. All encoders are frozen and sourced from MuQ-MuLan (\(t\) and \(l\) through text encoders, \(a_{\text{ref}}\) and \(a_{\text{eval}}\) through audio encoders). Missing modalities are treated as zero tensors. The prompt side uses a 4-layer Prompt Transformer to fuse embeddings: \(\mathbf{h}_{\text{prompt}}=\text{PromptTF}([\mathbf{E}_{t};\mathbf{E}_{l};\mathbf{E}_{a_{\text{ref}}}])\). A single-layer Joint Transformer models the interaction between the prompt and evaluation audio: \(\mathbf{h}_{\text{eval}}=\text{JointTF}([\mathbf{h}_{\text{prompt}};\mathbf{E}_{a_{\text{eval}}}])\). Final scores are produced by time pooling and a lightweight MLP: \((s_{\text{ALI}},s_{\text{MUS}})=\text{MLP}(\text{Pool}(\mathbf{h}_{\text{eval}}))\).
3. Two-stage Training: Pseudo BT Pre-training → Expert Fine-tuning
Stage 1 Preference Pre-training uses CMI-Pref-Pseudo (2k steps, batch 48) with Bradley–Terry (BT) modeling for pair-wise preference: \(P(A>B)=\sigma\big(s_\theta(\mathcal{P},A)-s_\theta(\mathcal{P},B)\big)\). Label smoothing (0.2) is applied to mitigate overconfidence from pseudo-labels. Stage 2 Expert Fine-tuning uses CMI-Pref + MusicEval (6,647 samples) with early stopping. Human labels include both pair-wise preferences (BT loss) and scalar ratings \(y\in[1,5]\) (regression loss).
4. CMI-RewardBench + top-k Inference-time Scaling
The authors integrate heterogeneous resources into a unified held-out benchmark: PAM (scalar MUS/ALI), MusicEval (scalar MUS), Music Arena (pair-wise preferences), and CMI-Pref (pair-wise MUS/ALI across four modality conditions). Evaluation metrics include LCC/SRCC/Kendall-Tau for ratings and accuracy for preferences. Beyond evaluation, CMI-RM enables top-k filtering—generating multiple candidates and selecting the highest-scoring one—to achieve "inference-time scaling."
Loss & Training¶
- Pair-wise preference: Bradley–Terry + Cross-Entropy, excluding ties, Stage 1 label smoothing 0.2.
- Scalar ratings: Regression for \(y\in[1,5]\).
- Total objective: \(\mathcal{L}_{\text{total}}=\frac{1}{2}(\mathcal{L}_{\text{MUS}}+\mathcal{L}_{\text{ALI}})\), joint optimization of both heads.
- Strategy: Batch size 48; Stage 1 for 2k steps, Stage 2 with ~250 steps early stopping.
Key Experimental Results¶
Main Results¶
Evaluation sources and protocols (all held-out):
| Source | Label Type | Scale | Protocol |
|---|---|---|---|
| PAM | Scalar (MUS + Text-Align) | 500 | LCC / SRCC / K-Tau |
| MusicEval | Scalar (MUS MOS) | 413 | LCC / SRCC / K-Tau |
| Music Arena | Pair-wise Preference (MUS) | 1,340 | Accuracy |
| CMI-Pref | Pair-wise (MUS + ALI) | 500 | Accuracy |
Dataset comparison (subset of Table 1, sample count refers to "pairs" for preference and "clips" for MOS):
| Dataset | Text | Lyrics | Audio Cond. | Samples | Models/APIs |
|---|---|---|---|---|---|
| PAM | ✔ | ✗ | ✗ | 500 | 5 |
| MusicEval | — | ✗ | ✗ | 2,748 | 31 |
| Music Arena | ✔ | ✔ | ✗ | 2,800 | 17 |
| CMI-Pref-Pseudo | ✔ | ✔ | ✔ | 110k | 23 |
| CMI-Pref | ✔ | ✔ | ✔ | 4,027 | 23 |
Key conclusion: Frontier MLLMs like Gemini-2.5-Pro struggle to exceed 80% human agreement, revealing a significant capability gap.
Ablation Study¶
| Setting | Function | Description |
|---|---|---|
| Stage 1 Only | Large-scale base | 2k steps / label smoothing 0.2 to handle pseudo-label noise |
| + Stage 2 | High-quality calibration | 6,647 samples for fine-grained human alignment |
| Frozen MuQ-MuLan Encoders | Parameter efficiency | ~30M trainable, matches/exceeds dedicated baselines like SongEval |
| top-k Filtering | Inference-time scaling | Uses CMI-RM to select high-quality candidates |
Key Findings¶
- Even frontier MLLMs (Gemini-2.5-Pro) fail to exceed 80% agreement, showing compositional music evaluation is far from solved.
- The ~30M CMI-RM covers all CMI-RewardBench settings in one architecture, performing comparably or better than specialized open-source baselines (e.g., SongEval).
- CMI-RM serves as both an evaluator and a "generation amplifier" via top-k filtering.
Highlights & Insights¶
- Data-fying the "Evaluation Lag": Instead of just another scorer, the authors built the entire ecosystem (large pseudo + expert small + unified benchmark). This "foundation first" approach is replicable for any generation subfield with lagging evaluation.
- Frozen Encoders + Dual Transformer: The design using frozen MuQ-MuLan encoders and zeroing out missing modalities elegantly handles any modality combination. The ~30M trainable parameters set a paradigm for lightweight multimodal reward modeling.
- Reward Model as Inference Amplifier: Top-k filtering provides a low-cost path to improve quality without retraining large generation model weights.
Limitations & Future Work¶
- Pseudo-labels distilled from Qwen3-Omni inherit LLM-judge biases, limiting the upper bound.
- Human agreement on the CMI-Pref test set is ~75%, reflecting the high subjective noise in compositional music preferences.
- Some resources (AIME, MusicPref) were excluded from the main benchmark due to split alignment issues.
Related Work & Insights¶
- vs MOS Predictors (PAM, SongEval): They are single-dimensional and rigid; CMI-RM outputs two dimensions and supports arbitrary combinations with less weight.
- vs Alignment Metrics (CLAP, MuQ-MuLan): These focus on text-to-audio; Ours adds lyrics/audio conditions and treats alignment as a trainable human proxy.
- vs Preference Platforms (Music Arena): Mostly text-to-music; CMI-Pref is the first large-scale compositional preference dataset with lyrics and audio.
- vs LLM-as-a-judge: Proprietary and un-optimized for music; this work shows MLLMs still have gaps and provides a parameter-efficient alternative.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to unify "Compositional Multimodal Instruction" for music evaluation with a full ecosystem.
- Experimental Thoroughness: ⭐⭐⭐⭐ Broad coverage across four sources; however, some per-task values are primarily in the appendix.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation and architecture; rigorous formulas.
- Value: ⭐⭐⭐⭐⭐ Open-sourced data/weights/benchmark; high practical value for AIGC post-training.