AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs¶

Conference: ICML2026
arXiv: 2606.07643
Code: Project Homepage
Area: Multimodal VLM / Audio-Visual Benchmark
Keywords: Audio-Visual Intelligence, Omni-MLLM, Cognitive Hierarchical Benchmark, Cross-modal grounding, Primitive Sensation

TL;DR¶

AVI-Bench is an audio-visual benchmark inspired by human cognition. It organizes the evaluation of Omni-MLLMs into three stages: "Perception → Understanding → Reasoning," supplemented by a "Primitive Sensation" (PriSe) extension. Using 14 tasks, 5,864 samples, and 9 metrics, it systematically diagnoses the Audio-Visual Intelligence (AVI) of 28 open-source/closed-source Omni-MLLMs and proposes a four-level AVI taxonomy.

Background & Motivation¶

Background: Omni-MLLMs (e.g., GPT-4o, Gemini, Qwen2.5-Omni) can simultaneously process text, vision, and audio, and are regarded as a key step toward human-like Audio-Visual Intelligence (AVI) and AGI. Measuring this progress requires rigorous, structured benchmarks.

Limitations of Prior Work: Most existing benchmarks are single-modality specialized (MMMU/SEED for vision-language, MMAU for audio-language), failing to reflect real-world cross-modal scenarios. Even audio-visual benchmarks like OmniBench, DailyOmni, and AV-Odyssey merely stack task diversity without a unified, structured framework to evaluate "multi-level" AVI. This results in fragmented evidence of model capabilities, making it difficult to diagnose failure modes or assess alignment with human audio-visual cognition. Crucially, achieving high scores on isolated tasks does not equate to progress in general intelligence.

Key Challenge: Evaluation requires "cognitive alignment"—testing models hierarchically like humans (perception, integration, reasoning). Existing benchmarks are flat task sets that lack stratification and generally overlook two critical capabilities: audio-visual grounding (localizing sounding objects) and cross-modal entity localization referenced by language, which are core to testing spatialized perception and reasoning.

Goal: To build a benchmark that is both broad and systematic, aligning tasks with different stages of human cognition to enable fine-grained diagnosis of Omni-MLLM capabilities and failure modes.

Key Insight: Approaching from the hierarchical perspective of cognitive science—human audio-visual processing progresses through "Perception → Understanding → Reasoning." Additionally, questioning a neglected fundamental problem: can models demonstrate the "primitive sensation" (distinguishing color, volume, texture, etc.) that humans perform effortlessly under unfamiliar, low-semantic stimuli, or are they merely fitting patterns from the training distribution?

Core Idea: Reconstruct audio-visual evaluation using three cognitive stages (plus a PriSe extension). Each stage deliberately balances audio-dominant, vision-dominant, and audio-visual collaborative tasks, incorporating neglected grounding tasks to eventually derive a four-level taxonomy for guiding future research.

Method¶

Overall Architecture¶

AVI-Bench is not a model but a comprehensive evaluation protocol + dataset + taxonomy. It organizes 14 tasks into four stages: Perception → Understanding → Reasoning → Primitive Sensation (PriSe). The first three stages correspond to the progressive levels of human cognition, while PriSe serves as an extension to test out-of-distribution generalization. Within each stage, audio-dominant, vision-dominant, and audio-visual协同 tasks are balanced to prevent scores from being dominated by a single modality. Finally, data from testing 28 Omni-MLLMs is used to summarize a four-level AVI taxonomy.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}%%
flowchart TD
    A["Audio-Visual Input<br/>T·A·V·I Multimodal"] --> B["Perception Stage<br/>AMIC·VMIC·AVL·AVM"]
    B --> C["Understanding Stage<br/>AVC·AVR·VAR"]
    C --> D["Reasoning Stage<br/>AVQA·AVLG·AVH·VAH"]
    D --> E["PriSe Extension<br/>ASQA·VSQA·AVSQA<br/>Low-semantic Unfamiliar Stimuli"]
    E --> F["28 Omni-MLLMs Evaluation<br/>9 Metrics·Percentage Norm"]
    F --> G["Four-level AVI Taxonomy<br/>task/modality/stage/domain-adaptive"]

Key Designs¶

1. Hierarchical Evaluation via Three Cognitive Stages: Decomposing AVI Capabilities Existing benchmarks present tasks in a flat structure, failing to reveal capability hierarchies. AVI-Bench categorizes tasks into three layers: Perception tests detection and cross-modal alignment of basic semantic entities, including Audio/Visual Multi-Instance Classification (AMIC/VMIC), Audio-Visual Localization (AVL, spatial localization of sound sources), and Audio-Visual Matching (AVM); Understanding tests the integration of temporal and semantic dependencies, including Audio-Visual Captioning (AVC) and bidirectional Cross-modal Retrieval (AVR/VAR); Reasoning tests higher-order inference, including Audio-Visual QA (AVQA), Audio-Visual Language Grounding (AVLG, precise localization via natural language), and Audio/Visual Reference Hallucination (AVH/VAH) to test robustness against cross-modal conflicts. The value of stratification is that scores are no longer vague numbers but can pinpoint exactly at which cognitive level a model fails.

2. PriSe Extension: Using Low-semantic Stimuli to Distinguish "Pattern Fitting vs. True Perception" Most Omni-MLLMs are trained on large-scale, semantically rich data, which does not prove they possess human-like "primitive sensation" (e.g., distinguishing brightness, volume, texture, geometry) in the absence of semantic context. PriSe uses simple, unfamiliar, low-semantic audio-visual stimuli to test three tasks: Audio Sensation QA (ASQA), Visual Sensation QA (VSQA), and Audio-Visual Sensation QA (AVSQA). If a model succeeds after removing semantic shortcuts, it possesses primitive sensation; if it collapses, previous high scores likely stemmed from pattern fitting within the training distribution. This stage, containing 2,090 samples, is the largest in the benchmark.

3. Modality Balancing & Grounding: Ensuring Fairness and Spatial Coverage To prevent a stage score from reflecting only a dominant modality, AVI-Bench explicitly balances three task types: audio-dominant (AMIC, VAR, AVH, ASQA), vision-dominant (VMIC, AVR, VAH, VSQA), and collaborative tasks requiring heavy interaction. Furthermore, it incorporates grounding tasks often ignored by existing benchmarks: AVL (spatial localization of sound sources) and AVLG (localization of audio-visual entities via language). These require converting dense mask annotations into normalized bounding boxes (708 samples), which is crucial for testing spatialized perception. All 14 tasks comprise 62% fully manually constructed samples (3,657), while the rest come from mask-to-bbox conversion (708) or restructuring existing datasets (1,499).

Mechanism: Diagnosing a Sample Through Four Stages¶

Consider an audio-visual clip of "a person playing guitar on a street while a car passes by": The Perception stage asks AMIC/VMIC "which instances are in the video/audio" (guitar sound, engine sound, person, car), AVL requires bounding the sound source (the guitar), and AVM judges if the audio matches the video. The Understanding stage requires AVC to generate a coherent narrative and AVR/VAR to perform cross-modal retrieval. The Reasoning stage asks AVQA "why did the passerby stop" (requires global understanding), AVLG requires precisely localizing "the guitar currently making sound," and AVH/VAH provides conflicting signals to test for hallucinations. Finally, PriSe replaces the content with low-semantic stimuli (e.g., pure color blocks and sine tones) and asks which is brighter or louder.

Key Experimental Results¶

Main Results¶

AVI-Bench evaluates 28 Omni-MLLMs (including closed-source GPT-4o, Gemini series, and open-source Qwen2.5-Omni, Ola, Baichuan-Omni-1.5, ranging from 0.5B to 7B+ parameters). All scores are normalized to a 100-point scale.

Benchmark	Modality	#Task	#Sample	#Metric	#Stage	Grounding
AV-Odyssey	T,A,V,I	7	4,555	2	1	✗
OmniBench	T,A,I	8	1,142	1	1	✗
AVHBench	T,A,V	4	5,302	7	1	✗
AVI-Bench	T,A,V,I	14	5,864	9	4	✓

Representative model scores by stage (percentage, higher is better):

Model	Perception Avg	Understanding Avg	Reasoning Avg	PriSe Avg	Total Avg
Gemini-2.5-pro	54.58	68.97	69.06	36.22	57.21

Ablation Study (Sample Distribution)¶

The four stages are balanced across modalities, with PriSe containing the largest sample size:

Stage	Task (#Samples)	Stage Total
Perception	AMIC(518), VMIC(521), AVL(205), AVM(250)	1,494
Understanding	VAR(264), AVR(264), AVC(280)	808
Reasoning	AVH(250), VAH(250), AVQA(469), AVLG(503)	1,472
PriSe	ASQA(502), VSQA-img(620), VSQA-vid(580), AVSQA(388)	2,090

Key Findings¶

Primitive Sensation is a universal weakness: Even the top-performing Gemini-2.5-pro scored only 36.22 in PriSe, significantly lower than other stages (54~69), confirming that high performance often relies on semantic context rather than core perception.
Grounding is a major hurdle: Tasks like AVL/AVLG requiring spatial localization are significant weaknesses for current Omni-MLLMs and represent a unique contribution of AVI-Bench.
Hierarchical diagnosis clarifies failure modes: Decomposing scores into cognitive stages allows identification of whether a model fails at perception, understanding, or reasoning levels.

Highlights & Insights¶

Dual constraints of cognitive hierarchy and modality balancing: This ensures evaluations follow human cognitive levels while preventing single-modality dominance, resulting in more interpretable scores.
PriSe as a "semantic-free" probe is novel: Separating "pattern fitting" from "true perception" using low-semantic stimuli is an effective probe for verifying AVI authenticity.
Inclusion of grounding in audio-visual evaluation fills a gap in spatialized perception/reasoning evaluation, with unified mask-to-bbox annotations facilitating reuse.
Four-level taxonomy provides a structured coordinate system (task/modality/stage/domain-adaptive) for diagnosing Omni-MLLM capabilities.

Limitations & Future Work¶

Lack of improvement methodology: AVI-Bench diagnoses shortcomings (especially in PriSe and grounding) but does not provide a training solution to address them.
Moderate sample size: With 5,864 samples across 14 tasks, the sample size per task (e.g., 205 for AVL) is relatively small, which may affect statistical robustness.
Metric dependency: Scoring for open-ended tasks (AVC, some QA) relies on automated metrics or LLM-as-a-judge, which may deviate from human judgment.

Vs. OmniBench / DailyOmni / AV-Odyssey: These expand task/domain diversity but lack a hierarchical structure; AVI-Bench uses four cognitive stages and adds PriSe to create a diagnostic capability map.
Vs. AVHBench / AVTrustBench: These focus exclusively on hallucinations; AVI-Bench treats hallucination (AVH/VAH) as a sub-dimension within the larger Perception → Reasoning framework.
Vs. MMMU / SEED / MMAU: These focus on vision-language or audio-language; AVI-Bench emphasizes joint processing and cross-modal synergy, closer to real human perception.

Rating¶

Novelty: ⭐⭐⭐⭐ (Cognitive stages + PriSe probe + grounding inclusion)
Experimental Thoroughness: ⭐⭐⭐⭐ (28 models × 14 tasks × 9 metrics)
Writing Quality: ⭐⭐⭐⭐ (Clear motivation and stage definitions)
Value: ⭐⭐⭐⭐ (Provides a structured, human-aligned diagnostic framework for the community)