See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models¶

Conference: CVPR 2026 arXiv: 2512.02231 Code: https://plnguyen2908.github.io/AV-SpeakerBench-project-page/ Area: Multimodal VLM / Audiovisual Understanding Keywords: audiovisual reasoning, speaker-centric benchmark, multimodal fusion, speech understanding, temporal localization

TL;DR¶

This paper introduces AV-SpeakerBench, a speaker-centric audiovisual reasoning benchmark comprising 3,212 multiple-choice questions, revealing Gemini 2.5 Pro's superiority in audiovisual fusion while exposing significant deficiencies of open-source models in speaker reasoning.

Background & Motivation¶

Background: Multimodal large language models have expanded from image–text to video and audio understanding, with increasing pursuit of unified processing of visual, audio, and linguistic modalities.
Limitations of Prior Work: Many existing video benchmarks (e.g., Video-MME) can be answered using visual information alone; audiovisual benchmarks either focus on non-speech sound events (AVQA) or coarse-grained classification (VGGSounder), without evaluating fine-grained speaker reasoning.
Key Challenge: No benchmark systematically evaluates whether models can jointly determine who is speaking, what was said, and when it was said.
Goal: To construct an audiovisual reasoning benchmark centered on the speaker as the core reasoning unit.
Key Insight: Fusion-driven question design that embeds audiovisual dependency into the semantics of questions and answer choices.
Core Idea: Every question requires cross-modal fusion to answer — for example, associating spoken phrases with visible speakers or localizing speech based on visual events.

Method¶

Overall Architecture¶

An IRB-approved benchmark consisting of 2,051 video clips and 3,212 four-choice multiple-choice questions spanning 12 task types. Videos are sourced from YouTube (movie clips, game shows, street interviews, etc.).

Key Designs¶

Speaker-Centric Task Design:
- Function: Shifts evaluation from scene-level understanding to person-centric audiovisual localization.
- Mechanism: The 12 tasks are grouped into three categories: speaker-centric (detection, identification, counting), vision-centric (attribute, activity, counting recognition), and audio-centric (recognition, duration, pitch, speaking rate, intensity, counting). Each task includes at least 200 validation questions.
- Design Motivation: To cover diverse speaker reasoning patterns, from basic perception to temporal reasoning.
Fusion-Driven Question Design:
- Function: Ensures that each question requires genuine audiovisual fusion.
- Mechanism: Audiovisual dependencies are embedded in question semantics via: (1) associating spoken phrases with visible identities; (2) localizing speech through visual events; (3) combining audiovisual cues in multi-speaker scenes. Distractors are drawn from entities and events within the same clip.
- Design Motivation: Prevents models from answering correctly using a single modality alone.
Expert-Curated Annotation Pipeline:
- Function: Ensures annotation quality and cross-modal validity.
- Mechanism: Annotators are experienced researchers rather than crowdworkers. A multi-stage refinement process includes: (1) initial review by independent researchers; (2) language model polishing; (3) final review by at least two additional researchers. Ambiguous or single-modality-solvable samples are filtered out.
- Design Motivation: Ensures all retained questions exhibit temporal sensitivity and speaker localization requirements.

Loss & Training¶

This is a pure evaluation benchmark with no training involved. Human baselines are established by graduate researchers.

Key Experimental Results¶

Main Results¶

Model	Speaker-Centric	Vision-Centric	Audio-Centric	Overall
Gemini 2.5 Pro	76.7	71.5	72.9	73.0
Qwen3-Omni-30B	54.5	51.8	53.7	54.1
Gemini 2.0 Flash	57.2	54.8	51.5	53.2
Human	94.4	93.5	92.3	93.7

Ablation Study¶

Configuration	Gemini 2.5 Pro	Qwen3-Omni	Notes
Vision Only	~55–60%	~50–55%	Basic visual capability
Audio + Vision	~70–80%	~50–55%	Gemini gains 10–20 pp
Audio Gain	+10–20 pp	0 or negative	Core fusion capability gap

Key Findings¶

Adding audio input yields a consistent 10–20 pp improvement for Gemini 2.5 Pro, while gains for Qwen3-Omni are marginal or even negative.
Error analysis identifies audio perception errors and temporal localization errors as the primary failure modes.
All models exhibit accuracy degradation as the number of visible speakers increases, with multi-speaker scenes posing the greatest challenge.
Early open-source audiovisual models (Video-LLaMA, PandaGPT) perform near chance level.

Highlights & Insights¶

Fusion Capability Diagnosis: Modality ablation experiments clearly reveal the fusion capability gap across different models.
Error Taxonomy: A systematic categorization of failure types is provided, including visual/audio perception errors, cross-modal attribution errors, and temporal localization errors.
Design Rationale: The authors acknowledge that strong models may partially answer questions via visual cues (e.g., lip motion), framing this as a legitimate capability rather than a benchmark flaw.

Limitations & Future Work¶

In certain tasks, strong vision-only models can answer questions without audio, which, while acknowledged as a valid capability, reduces the necessity of the audio modality.
All videos are sourced from YouTube, with scenes predominantly from film and television.
The evaluation currently covers a limited number of open-source audiovisual models.

vs. Video-MME: Questions in Video-MME can largely be answered using vision alone, whereas AV-SpeakerBench enforces audiovisual fusion.
vs. WorldSense: WorldSense focuses on scene–acoustic matching, while AV-SpeakerBench targets speaker–speech binding.

Rating¶

Novelty: ⭐⭐⭐⭐ Fills a gap in speaker-centric audiovisual reasoning evaluation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers both open- and closed-source models with comprehensive modality ablation and error analysis.
Writing Quality: ⭐⭐⭐⭐ Well-structured with illuminating case studies.
Value: ⭐⭐⭐⭐⭐ Provides a much-needed diagnostic tool for the development of audiovisual fusion models.