Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional¶
Conference: ICLR2026
arXiv: 2509.23499
Code: To be confirmed
Area: Multimodal VLM
Keywords: Multimodal benchmark evaluation, VQA, modality bias, unimodal shortcuts, MLLM
TL;DR¶
A large-scale empirical study reveals severe unimodal dependency issues across 23 VQA benchmarks — many benchmarks designed to eliminate text bias have instead introduced image bias, with models exploiting unimodal shortcuts rather than performing genuine cross-modal reasoning.
Background & Motivation¶
- Multimodal large language models (MLLMs) have achieved rapidly rising scores on various VQA benchmarks, but do these high scores truly reflect cross-modal understanding?
- Early VQA research identified the text-only bias problem (models answering correctly from the question alone), prompting the community to design a set of "debiased" benchmarks.
- However, whether these debiased benchmarks genuinely resolve the issue — or introduce new biases — has not been systematically quantified.
- A unified framework is needed to measure intra-modality dependency (predictability within a single modality) and inter-modality dependency (necessity of cross-modal interaction) in datasets.
Method¶
Overall Architecture¶
Given a multimodal dataset \(\mathcal{D} = \{(\mathbf{x}_1, \mathbf{x}_2, \mathbf{y})\}\) (image, text, label), MLLM performance is measured under four input conditions to quantify modality dependency: (1) normal paired input; (2) image only (text randomly replaced); (3) text only (image randomly replaced); (4) fully random (both modalities replaced). Performance differences reveal the independent contribution of each modality and the necessity of cross-modal interaction.
Key Designs¶
-
Modality Shuffling:
- Zero-masking (e.g., blank images) or perturbation-based approaches are avoided, as they create unnatural out-of-distribution inputs.
- Shuffling preserves the marginal distribution of each modality while disrupting only cross-modal correlations.
- This more accurately quantifies inter-modality dependency without inducing unpredictable model behavior.
-
Multi-model Ensemble Voting:
- Analysis based on a single model may be confounded by model-specific inductive biases.
- Majority voting is performed across a diverse set of models: Cambrian-1 at three scales (8B, 13B, 34B), LLaVA-Next, Qwen2.5-VL, and Qwen3-VL.
- This marginalizes single-model bias and yields more robust estimates of dataset-intrinsic dependency.
-
Sub-category Granularity Analysis:
- Globally aggregated metrics may obscure unimodal dependency within sub-categories.
- Modality shuffling is applied separately to different sub-categories of the same benchmark (e.g., question types, knowledge domains).
- This reveals that benchmarks appearing "cross-modal" at the global level may still exhibit strong unimodal dependency in specific sub-categories.
-
Systematic Evaluation Across 23 VQA Benchmarks: Coverage includes general VQA (VizWiz, GQA, MME, etc.), expert VQA (ScienceQA, MathVista, MMMU, etc.), spatial understanding (POPE, RealWorldQA, etc.), and OCR/document understanding (TextVQA, ChartQA, etc.).
Key Experimental Results¶
Main Results (Modality Dependency Analysis Across 23 VQA Benchmarks)¶
| Dependency Type | Representative Benchmarks | Description |
|---|---|---|
| Cross-modal only | MME, POPE, COCO, V*Bench | Only 4/23 benchmarks genuinely require cross-modal reasoning |
| Text bias present | GQA (+26%), ScienceQA (+17.5%), MMMU (+11.4%) | Models answer correctly without viewing the image |
| Image bias present | MMBench (+41%), SEED, TextVQA, ChartQA | Models answer correctly without viewing the question |
| Dual bias | MMMU-Pro, Q-Bench, MM-Star | Both modalities independently provide shortcuts |
Effect of Model Scale¶
| Model | MMBench Image Bias | MMMU Text Bias | POPE Cross-modal |
|---|---|---|---|
| 8B | Moderate | Moderate | Random level |
| 13B | Increased | Increased | Random level |
| 34B | Highest | Highest | Random level |
| Ensemble | Highest | Highest | Random level |
Cross-model Type Analysis¶
| Model | Release Date | MMBench Image Bias | GQA Text Bias |
|---|---|---|---|
| Cambrian-8B | 2024.06 | Medium | Medium |
| LLaVA-Next | 2024.05 | Medium | Medium |
| Qwen2.5-VL | 2025.04 | High | Medium |
| Qwen3-VL | 2025.11 | Highest | Medium |
Key Findings¶
- Only 4 out of 23 benchmarks genuinely require cross-modal reasoning — the vast majority can be largely solved by unimodal signals alone.
- Many benchmarks designed to eliminate text bias have instead introduced stronger image bias (MMBench image bias: +41%).
- Larger models are more adept at exploiting unimodal shortcuts — the 34B model exhibits the highest image bias on MMBench.
- Modality dependency varies substantially across sub-categories within the same benchmark — global metrics mask unimodal dependency at the sub-category level.
- Consistent bias patterns are observed across different model architectures and generations — the problem lies in dataset design rather than in the models.
Highlights & Insights¶
- The first systematic quantification of modality bias distributions across 23 mainstream VQA benchmarks, with findings that directly address a critical community pain point.
- The finding that "eliminating text bias → introducing image bias" is highly counter-intuitive and serves as an important warning for benchmark design.
- The conclusion that larger models are better at exploiting shortcuts challenges the "scale solves everything" assumption.
- The data spectrum framework is concise and general, and can be extended to other multimodal tasks (audio-visual VQA, multimodal retrieval, etc.).
Limitations & Future Work¶
- The analysis is primarily based on black-box probing (text-only/image-only testing) and lacks examination of internal model mechanisms (e.g., attention, gradients).
- Coverage is limited to the VQA paradigm; multimodal generation, retrieval, and other task formats are not addressed.
- The quantitative metrics of the data spectrum are contingent on the capability ceiling of the specific MLLMs used; different models may yield different spectral distributions.
- No concrete recommendations are provided for designing benchmarks that genuinely require cross-modal reasoning.
Related Work & Insights¶
- VQA bias analysis: VQA-CP, AdVQA — this paper extends point-wise bias analysis to systematic spectral analysis.
- MLLM evaluation: MMBench, MM-Vet — this paper reveals that these evaluations may overestimate cross-modal reasoning ability.
- Dataset design: Balanced VQA, CounterFactual VQA — eliminating one type of bias while inadvertently introducing another.
- Insight: Future multimodal benchmark design should first calibrate the bias distribution on the data spectrum to ensure that cross-modal reasoning is the genuine bottleneck rather than a shortcut.
- Extended reflection: The data spectrum methodology can be generalized to benchmark quality auditing for more modality combinations, such as audio-language and video-language.
- Recommendation: All multimodal benchmark papers should report image-only, text-only, and random baseline scores, rather than reporting only full-input performance.
Rating¶
- Novelty: ⭐⭐⭐⭐ The data spectrum framework is novel, with counter-intuitive and valuable findings.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 23 benchmarks × 36 models with multi-scale and multi-type validation — extremely comprehensive.
- Writing Quality: ⭐⭐⭐⭐ Clear structure; radar charts and sub-category visualizations are intuitive.
- Value: ⭐⭐⭐⭐⭐ Directly informative for community benchmark design and model evaluation standards.