Skip to content

Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional

Conference: ICLR2026
arXiv: 2509.23499
Code: To be confirmed
Area: Multimodal VLM
Keywords: Multimodal benchmark evaluation, VQA, modality bias, unimodal shortcuts, MLLM

TL;DR

A large-scale empirical study reveals severe unimodal dependency issues across 23 VQA benchmarks — many benchmarks designed to eliminate text bias have instead introduced image bias, with models exploiting unimodal shortcuts rather than performing genuine cross-modal reasoning.

Background & Motivation

  • Multimodal large language models (MLLMs) have achieved rapidly rising scores on various VQA benchmarks, but do these high scores truly reflect cross-modal understanding?
  • Early VQA research identified the text-only bias problem (models answering correctly from the question alone), prompting the community to design a set of "debiased" benchmarks.
  • However, whether these debiased benchmarks genuinely resolve the issue — or introduce new biases — has not been systematically quantified.
  • A unified framework is needed to measure intra-modality dependency (predictability within a single modality) and inter-modality dependency (necessity of cross-modal interaction) in datasets.

Method

Overall Architecture

Given a multimodal dataset \(\mathcal{D} = \{(\mathbf{x}_1, \mathbf{x}_2, \mathbf{y})\}\) (image, text, label), MLLM performance is measured under four input conditions to quantify modality dependency: (1) normal paired input; (2) image only (text randomly replaced); (3) text only (image randomly replaced); (4) fully random (both modalities replaced). Performance differences reveal the independent contribution of each modality and the necessity of cross-modal interaction.

Key Designs

  1. Modality Shuffling:

    • Zero-masking (e.g., blank images) or perturbation-based approaches are avoided, as they create unnatural out-of-distribution inputs.
    • Shuffling preserves the marginal distribution of each modality while disrupting only cross-modal correlations.
    • This more accurately quantifies inter-modality dependency without inducing unpredictable model behavior.
  2. Multi-model Ensemble Voting:

    • Analysis based on a single model may be confounded by model-specific inductive biases.
    • Majority voting is performed across a diverse set of models: Cambrian-1 at three scales (8B, 13B, 34B), LLaVA-Next, Qwen2.5-VL, and Qwen3-VL.
    • This marginalizes single-model bias and yields more robust estimates of dataset-intrinsic dependency.
  3. Sub-category Granularity Analysis:

    • Globally aggregated metrics may obscure unimodal dependency within sub-categories.
    • Modality shuffling is applied separately to different sub-categories of the same benchmark (e.g., question types, knowledge domains).
    • This reveals that benchmarks appearing "cross-modal" at the global level may still exhibit strong unimodal dependency in specific sub-categories.
  4. Systematic Evaluation Across 23 VQA Benchmarks: Coverage includes general VQA (VizWiz, GQA, MME, etc.), expert VQA (ScienceQA, MathVista, MMMU, etc.), spatial understanding (POPE, RealWorldQA, etc.), and OCR/document understanding (TextVQA, ChartQA, etc.).

Key Experimental Results

Main Results (Modality Dependency Analysis Across 23 VQA Benchmarks)

Dependency Type Representative Benchmarks Description
Cross-modal only MME, POPE, COCO, V*Bench Only 4/23 benchmarks genuinely require cross-modal reasoning
Text bias present GQA (+26%), ScienceQA (+17.5%), MMMU (+11.4%) Models answer correctly without viewing the image
Image bias present MMBench (+41%), SEED, TextVQA, ChartQA Models answer correctly without viewing the question
Dual bias MMMU-Pro, Q-Bench, MM-Star Both modalities independently provide shortcuts

Effect of Model Scale

Model MMBench Image Bias MMMU Text Bias POPE Cross-modal
8B Moderate Moderate Random level
13B Increased Increased Random level
34B Highest Highest Random level
Ensemble Highest Highest Random level

Cross-model Type Analysis

Model Release Date MMBench Image Bias GQA Text Bias
Cambrian-8B 2024.06 Medium Medium
LLaVA-Next 2024.05 Medium Medium
Qwen2.5-VL 2025.04 High Medium
Qwen3-VL 2025.11 Highest Medium

Key Findings

  • Only 4 out of 23 benchmarks genuinely require cross-modal reasoning — the vast majority can be largely solved by unimodal signals alone.
  • Many benchmarks designed to eliminate text bias have instead introduced stronger image bias (MMBench image bias: +41%).
  • Larger models are more adept at exploiting unimodal shortcuts — the 34B model exhibits the highest image bias on MMBench.
  • Modality dependency varies substantially across sub-categories within the same benchmark — global metrics mask unimodal dependency at the sub-category level.
  • Consistent bias patterns are observed across different model architectures and generations — the problem lies in dataset design rather than in the models.

Highlights & Insights

  • The first systematic quantification of modality bias distributions across 23 mainstream VQA benchmarks, with findings that directly address a critical community pain point.
  • The finding that "eliminating text bias → introducing image bias" is highly counter-intuitive and serves as an important warning for benchmark design.
  • The conclusion that larger models are better at exploiting shortcuts challenges the "scale solves everything" assumption.
  • The data spectrum framework is concise and general, and can be extended to other multimodal tasks (audio-visual VQA, multimodal retrieval, etc.).

Limitations & Future Work

  • The analysis is primarily based on black-box probing (text-only/image-only testing) and lacks examination of internal model mechanisms (e.g., attention, gradients).
  • Coverage is limited to the VQA paradigm; multimodal generation, retrieval, and other task formats are not addressed.
  • The quantitative metrics of the data spectrum are contingent on the capability ceiling of the specific MLLMs used; different models may yield different spectral distributions.
  • No concrete recommendations are provided for designing benchmarks that genuinely require cross-modal reasoning.
  • VQA bias analysis: VQA-CP, AdVQA — this paper extends point-wise bias analysis to systematic spectral analysis.
  • MLLM evaluation: MMBench, MM-Vet — this paper reveals that these evaluations may overestimate cross-modal reasoning ability.
  • Dataset design: Balanced VQA, CounterFactual VQA — eliminating one type of bias while inadvertently introducing another.
  • Insight: Future multimodal benchmark design should first calibrate the bias distribution on the data spectrum to ensure that cross-modal reasoning is the genuine bottleneck rather than a shortcut.
  • Extended reflection: The data spectrum methodology can be generalized to benchmark quality auditing for more modality combinations, such as audio-language and video-language.
  • Recommendation: All multimodal benchmark papers should report image-only, text-only, and random baseline scores, rather than reporting only full-input performance.

Rating

  • Novelty: ⭐⭐⭐⭐ The data spectrum framework is novel, with counter-intuitive and valuable findings.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 23 benchmarks × 36 models with multi-scale and multi-type validation — extremely comprehensive.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure; radar charts and sub-category visualizations are intuitive.
  • Value: ⭐⭐⭐⭐⭐ Directly informative for community benchmark design and model evaluation standards.