Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering¶
Conference: ACL 2026 Findings
arXiv: 2604.09721
Code: None
Area: Audio & Speech / Music Understanding
Keywords: Music Question Answering, Multi-track Comparative Reasoning, Audio-Language Models, Benchmark Datasets, LLM-as-a-Judge
TL;DR¶
Construcs Jamendo-MT-QA, a multi-track comparative music QA benchmark containing 36,519 comparative QA pairs (covering 12,173 track pairs), to systematically evaluate the cross-track comparative reasoning capabilities of audio-language models for the first time, revealing significant deficiencies of existing models in sentence-level comparison generation.
Background & Motivation¶
Background: Music Question Answering (Music-QA) research primarily focuses on single-track understanding, such as tag prediction, captioning, and classification. However, listeners often describe music in a comparative manner (e.g., "This song is darker than the last one"), and existing benchmarks do not systematically evaluate cross-track comparative reasoning.
Limitations of Prior Work: (1) Single-track benchmarks might be driven by text cues rather than true audio perception; (2) Audio-language models (e.g., CLAP, MU-LLaMA) perform strongly on single-track tasks but lack evaluation for multi-track comparative reasoning; (3) There is a lack of datasets specifically designed for musical comparative reasoning.
Key Challenge: Existing Music-QA benchmarks cannot distinguish whether models truly understand audio content or rely on text shortcuts, nor can they evaluate cross-track relationship reasoning capabilities.
Goal: To build a systematic multi-track comparative QA benchmark to evaluate and expose the shortcomings of existing models.
Key Insight: Based on the Jamendo-QA dataset, utilize LLMs to assist in generating three types of comparative questions (Yes/No, Short-answer, Sentence-level), and implement quality control through human evaluation + LLM-as-a-Judge.
Core Idea: Construct a high-quality comparative QA benchmark through an LLM-assisted four-stage pipeline (Music Captioning → Single-track QA Expansion → Multi-track Comparative QA Generation → Quality Filtering).
Method¶
Overall Architecture¶
A four-stage construction process: Stage 1 uses Music Flamingo to generate high-quality captions for each track; Stage 2 uses GPT-5.1 to expand these into single-track QA pairs; Stage 3 uses GPT-5 mini to generate three types of comparative questions (Yes/No, Short-answer, Sentence-level) for each track pair; Stage 4 performs quality filtering via human evaluation and LLM-as-a-Judge. While Stages 1-2 reuse existing models as scaffolding, Stages 3-4 are the two core design contributions of this work. After the benchmark is constructed, a dual-pathway baseline is used for diagnostic evaluation.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
A["Jamendo-QA Track Library"] --> B["Stage 1: Music Flamingo<br/>Generates Music Captions"]
B --> C["Stage 2: GPT-5.1<br/>Expands Single-track QA"]
C --> D["Multi-type Comparative Question Design<br/>GPT-5 mini generates Yes/No · Short · Sentence-level questions"]
D --> E["LLM-as-a-Judge Quality Control<br/>Large-scale review after human alignment<br/>Keep semantic scores 5/5/5"]
E --> F["Jamendo-MT-QA Benchmark<br/>36,519 Comparative QA Pairs"]
subgraph EVAL["Dual-Pathway Baseline Evaluation"]
direction TB
G["Multi-audio Baseline<br/>GPT-4o Audio / Qwen3-Omni<br/>Directly handles dual audio input"]
H["Caption Baseline<br/>Generate captions first, then compare via LLM"]
end
F --> EVAL
Key Designs¶
-
Multi-type Comparative Question Design: Three types of questions are generated for each track pair: Yes/No questions (e.g., "Is Track A faster than Track B?"), Short-answer questions (selecting the track matching a specific description), and Sentence-level questions (requiring a full comparative analysis). These cover a difficulty gradient from simple judgment to complex reasoning. Human evaluation shows sentence-level questions are significantly more difficult than the first two categories.
-
LLM-as-a-Judge Quality Control: Human evaluation and GPT-5 mini scores were first aligned on 300 samples, verifying that the LLM review matches human judgment trends across semantic quality standards (Correctness: 4.87 vs. Human 4.79; Comparative Validity: 4.61 vs. 4.83; Reasoning Quality: 4.37 vs. 4.78). This was then expanded to the full dataset, retaining only QA groups where all three semantic measures scored 5/5/5.
-
Dual-pathway Baseline Evaluation: Two types of baselines were designed—the Multi-audio baseline (e.g., GPT-4o Audio, Qwen3-Omni directly processing dual audio inputs) and the Caption baseline (e.g., Music Flamingo generating captions first, then compared by an LLM)—to decouple the contributions of multi-audio perception versus high-level semantic reasoning.
Loss & Training¶
This work focuses on benchmark construction and does not involve model training. Evaluation metrics include Accuracy for Yes/No and Short-answer, and BLEU, ROUGE-1/2/L, BERTScore, and LLM-as-a-Judge (1-5 scale) for sentence-level questions.
Key Experimental Results¶
Main Results (Full 12,173 Track Pairs)¶
| Model | Type | Yes/No Acc | Short Acc | BLEU | BERT-F1 | GPT Judge | Claude Judge |
|---|---|---|---|---|---|---|---|
| Music Flamingo | Cap | 77.4% | 89.7% | 4.00 | 0.879 | 3.24 | 3.87 |
| Qwen2-Audio | Cap | 37.4% | 39.1% | 1.88 | 0.849 | 1.49 | 1.53 |
| MU-LLaMA | Cap | 20.6% | 55.3% | 2.39 | 0.857 | 2.36 | 2.01 |
| Qwen2-Audio | Multi | 50.9% | 80.2% | 2.09 | 0.847 | 1.37 | 1.62 |
| Qwen3-Omni | Multi | 62.9% | 80.3% | 3.58 | 0.863 | 3.11 | 3.48 |
Ablation Study¶
- The Caption baseline (Music Flamingo) reached 77.4% on Yes/No, outperforming most multi-audio baselines, suggesting that high-quality captioning + text reasoning is a viable path.
- Qwen2-Audio's multi-audio mode (50.9%) showed a significant improvement over its caption mode (37.4%) in Yes/No questions.
- All models scored \(\leq 3.87/5\) on LLM Judge for sentence-level questions, exposing the massive challenge of comparative reasoning.
Key Findings¶
- The Caption baseline Music Flamingo performed best overall, indicating that current multi-audio models have not yet fully exploited the advantages of audio input.
- Sentence-level comparative generation is the primary bottleneck, requiring cross-track multi-attribute integration and coherent natural language expression.
- 92.9% of cross-genre track pairs were retained in the dataset, showing that the filtering strategy does not harm diversity.
- Human difficulty ratings confirmed the gradient: Sentence-level > Short-answer > Yes/No.
Highlights & Insights¶
- Filling the Comparative Reasoning Gap: The first music QA benchmark specifically designed to evaluate cross-track comparative reasoning.
- Diagnostic Design: The dual-pathway baseline (Caption vs. Multi-audio) effectively decouples perception capabilities from reasoning capabilities.
- Quality Control Innovation: The large-scale LLM review process, validated by human-LLM alignment, can be generalized to the construction of other datasets.
Limitations & Future Work¶
- Comparative questions are generated based on captions and metadata rather than directly from audio, which may introduce textual bias.
- Evaluation of sentence-level questions relies on LLM Judge, whose reliability still has room for improvement.
- The comparative understanding capabilities of generative music models were not evaluated.
- Future work could extend to comparative reasoning across more than two tracks (>2).
Related Work & Insights¶
- Migration of relational reasoning concepts from multi-hop QA (HotpotQA, DROP) to the audio domain.
- The LLM-as-a-Judge evaluation paradigm is becoming popular in NLP benchmarks.
- This work inspires the construction of comparative reasoning benchmarks for other modalities (e.g., video comparative QA).
Rating¶
- Novelty: ⭐⭐⭐⭐ The problem definition of multi-track comparative reasoning is novel, with a refined dataset construction methodology.
- Experimental Thoroughness: ⭐⭐⭐⭐ Baseline evaluations across multiple models are sufficient, though some models used subsets due to computational costs.
- Writing Quality: ⭐⭐⭐⭐ Pipeline descriptions are clear and the quality control process is detailed.
- Value: ⭐⭐⭐⭐ Provides an important comparative reasoning benchmark and diagnostic tool for the music understanding field.