MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark¶

Conference: ICLR 2026 arXiv: 2506.04779 Code: https://huggingface.co/datasets/ddwang2000/MMSU Area: Audio & Speech Keywords: Speech Understanding, SpeechLLM, Linguistics Benchmark, Multi-task Evaluation, Perception and Reasoning

TL;DR¶

This paper introduces MMSU (5,000 audio QA items across 47 tasks), the first benchmark to systematically incorporate linguistic theory into spoken language understanding and reasoning evaluation. Evaluating 22 SpeechLLMs, it reveals significant gaps in phonological perception and complex reasoning among existing models.

Background & Motivation¶

Background: SpeechLLMs (e.g., Qwen-Audio, Kimi-Audio, Gemini) have demonstrated strong capabilities in processing audio inputs, achieving impressive performance on ASR and audio understanding tasks. However, their abilities in fine-grained speech perception and complex reasoning remain systematically unevaluated.

Limitations of Prior Work: Existing speech benchmarks suffer from three major shortcomings: - Narrow coverage: Primarily focused on semantic-level tasks, neglecting non-linguistic phenomena common in everyday speech (hesitations, sarcasm, self-corrections, prosodic variations, etc.) - Insufficient data authenticity: Heavy reliance on TTS-synthesized speech, lacking the acoustic diversity of real human speech - Absence of linguistic theory: Evaluation designs do not consider foundational principles from phonetics, prosody, rhetoric, and related fields, resulting in systematic blind spots

Key Challenge: Genuine spoken language understanding requires not only comprehending what is said (semantics), but also how it is said (prosody, emotion) and what is truly meant (pragmatics)—dimensions that existing benchmarks fail to assess.

Goal: To construct a comprehensive, linguistically grounded evaluation framework that systematically assesses SpeechLLM capabilities along both perception and reasoning dimensions.

Key Insight: A top-down task taxonomy is designed based on a structured linguistic theoretical framework spanning phonetics, prosody, rhetoric, syntax, semantics, and paralinguistics.

Core Idea: Systematically integrate linguistic theory into speech benchmark design, creating a comprehensive evaluation framework across 47 tasks that exposes critical weaknesses of SpeechLLMs in phonological perception and reasoning.

Method¶

Overall Architecture¶

MMSU comprises 5,000 expert-annotated multiple-choice questions (MCQs) covering 47 tasks, organized in a three-level hierarchy: - Level 1: Perception (24 tasks) vs. Reasoning (23 tasks) - Level 2: Linguistics vs. Paralinguistics - Level 3: Semantics / Phonology / Speaker Traits / Speaking Style

Key Designs¶

Fine-grained Acoustic Feature Coverage:
- Function: Covers non-linguistic sounds (crying, coughing), accents (Indian, British), emotional states, prosodic features (stress, lengthening, pauses), and intonation variation
- Mechanism: Dedicated tasks are designed for each dimension based on sub-field theories within phonetics
- Design Motivation: To fill the gap left by existing benchmarks in acoustic feature coverage
High-quality Data Assurance:
- Function: Prioritizes authentic speech data, supplemented by professional voice actor recordings and a small number of multi-speaker additions
- Mechanism: A four-stage pipeline—linguistic framework design → question collection and option augmentation → audio acquisition → human review (10 annotators, multiple review rounds)
- Design Motivation: TTS-synthesized speech cannot capture the subtle acoustic characteristics of human speech
Systematic Integration of Linguistic Theory:
- Function: First benchmark to include tasks such as tongue-twister comprehension, sarcasm detection, homophone reasoning, intonation inference, and couplet matching
- Mechanism: Tasks are derived from six sub-disciplines: phonetics, prosody, rhetoric, syntax, semantics, and paralinguistics
- Design Motivation: To move evaluation beyond surface-level semantics toward a deeper, multi-layered linguistic understanding

Loss & Training¶

Not applicable (this is a benchmark paper). Evaluation uses unified instruction prompts with randomized option ordering to mitigate position bias.

Key Experimental Results¶

Main Results¶

Model	Size	Perception Avg	Reasoning Avg	Overall Avg
Human	-	91.24	86.77	89.72
Gemini-2.0-Flash	-	57.51	68.15	62.63
GPT-4o-Audio	-	57.30	66.62	61.67
Qwen2.5-Omni-7B	7B	53.26	69.99	61.25
Kimi-Audio	7B	43.52	76.03	59.28
Qwen2.5-Omni-3B	3B	42.37	72.76	56.83
MiniCPM-O	8.6B	40.54	73.57	56.53
MERaLiON	10B	35.74	73.68	54.10
SALMONN	7B	29.83	30.04	30.01
Random Guess	-	25.02	25.37	25.37

Ablation Study¶

Dimension	Best Model	Accuracy	Human Performance	Gap
Perception–Semantics	Kimi-Audio	57.64%	87.10%	−29.5
Perception–Phonology	Qwen2-Audio	44.93%	94.32%	−49.4
Perception–Paralinguistics	Qwen2.5-Omni-3B	39.19%	92.88%	−53.7
Reasoning–Semantics	Qwen2.5-Omni-7B	81.52%	82.16%	−0.6
Reasoning–Phonology	Qwen2.5-Omni-7B	82.39%	87.60%	−5.2

Key Findings¶

Large human–machine gap: The best model achieves an overall accuracy of 62.63%, compared to 89.72% for humans—a gap of 27 percentage points
Phonological perception is the largest bottleneck: The best model reaches only 44.93% on the Perception–Phonology dimension, nearly 50 points below human performance
Reasoning outperforms perception: Models approach human-level performance on semantic reasoning but fall significantly short on perception tasks requiring integration of acoustic cues
Closed-source models show no clear advantage: Gemini/GPT-4o only marginally outperform Qwen2.5-Omni-7B, suggesting that perception capabilities do not scale proportionally with model size
End-to-end models outperform cascade models: Models that directly process audio outperform those relying on ASR transcription followed by text-based understanding

Highlights & Insights¶

The first benchmark to systematically incorporate linguistic theory into spoken language understanding evaluation, yielding task designs with genuine disciplinary depth
The 47-task coverage substantially exceeds prior benchmarks, most notably MMAU (27 tasks)
A key insight is revealed: SpeechLLMs' reasoning capabilities already approach human-level performance, whereas their perceptual capabilities—particularly phonological perception—lag far behind
High data quality is ensured through prioritization of authentic speech, expert review, and multi-round annotation

Limitations & Future Work¶

Currently limited to English; multilingual coverage remains to be extended
The four-option MCQ format may not fully reflect open-ended spoken language understanding ability
Some tasks have limited sample sizes (~100 items per task), warranting attention to statistical significance
Multi-turn conversational speech understanding scenarios are not included
Further analysis of error types and patterns could guide targeted model improvement

MMSU is complementary to benchmarks such as VoiceBench, MMAU, and AIR-Bench, being the first to cover prosody, intonation, and rhetorical dimensions
The finding that "perception ≠ reasoning" provides an important direction for SpeechLLM training strategies: acoustic perception capabilities should be a primary focus of improvement
MMSU offers a new paradigm for multimodal evaluation: using disciplinary theory to guide benchmark design, avoiding the passive approach of "evaluating only what is readily available"

Rating¶

Novelty: ⭐⭐⭐⭐ — First systematic application of linguistic theory to speech benchmark design
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 22 models, 47 tasks, with human baselines; evaluation is exceptionally comprehensive
Writing Quality: ⭐⭐⭐⭐ — Well-structured with a coherent task taxonomy
Value: ⭐⭐⭐⭐⭐ — Reveals critical bottlenecks in SpeechLLMs and provides an important evaluation infrastructure for the community