MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning¶
Conference: ICLR 2026 arXiv: 2603.02024 Code: Project Page Area: Multimodal Evaluation Benchmark Keywords: Multimodal reasoning, multi-image reasoning, real-life scenes, reasoning types, benchmark evaluation
TL;DR¶
This paper introduces MMR-Life, a benchmark comprising 2,646 five-choice multi-image questions based on 19,108 real-life images, covering 7 reasoning types and 21 tasks. It is the first systematic evaluation of MLLMs on multi-image reasoning in real-life scenarios. The strongest model, GPT-5, achieves only 58.69% accuracy—14 percentage points below human performance. Key findings include the failure of reasoning enhancement methods on large models and the weaker generalization of RL compared to BoN.
Background & Motivation¶
-
Two dominant paradigms exist for MLLM reasoning evaluation, yet both diverge from everyday reasoning scenarios:
- Knowledge-intensive benchmarks (MMMU, GPQA, etc.): Use expert-level STEM questions; professional knowledge is rarely required in daily reasoning.
- Synthetic/symbolic benchmarks (VisualPuzzles, PuzzleVQA, etc.): Use puzzles or symbolic patterns that are far removed from real visual scenes.
-
Multi-image input is severely underrepresented:
- Most multimodal reasoning benchmarks rely on single-image input (MMMU averages 1.05 images per question), inconsistent with how humans actually acquire information from sequences of images.
- Existing multi-image benchmarks either include non-reasoning tasks or cover only a limited set of reasoning types (e.g., spatial reasoning only).
-
Core need: A comprehensive benchmark covering diverse reasoning types, grounded in real-life scenes, and supporting multi-image input for MLLM evaluation.
Method¶
Overall Architecture¶
MMR-Life is a multi-image multimodal reasoning evaluation benchmark with the following core design:
- Scale: 2,646 five-choice questions based on 19,108 real-life images.
- Reasoning coverage: 7 reasoning types across 21 subtasks.
- Characteristics: No domain expertise required; questions demand integration of information across multiple images and the application of diverse reasoning skills.
- Average image count: 7.22 images per question, far exceeding existing benchmarks.
Key Designs¶
-
Systematic taxonomy of 7 reasoning types
- Abductive: Inferring the most plausible explanation from observations (307 questions, 11.60%).
- Analogical: Identifying similarities to infer new situations (568 questions, 21.47%).
- Causal: Inferring effects from causes (263 questions, 9.94%).
- Deductive: Deriving specific conclusions from general rules (282 questions, 10.66%).
- Inductive: Generalizing patterns from specific observations (429 questions, 16.21%).
- Spatial: Understanding object positions and spatial relationships (255 questions, 9.64%).
- Temporal: Reasoning about event order and timing (542 questions, 20.48%).
-
Data collection pipeline (multi-source + multi-stage quality control)
- Sources: Public image datasets (Kaggle), open web resources (eBird, etc.), public video sources (frame extraction), and existing benchmarks.
- Question generation: Rule-based automatic synthesis (e.g., temporal ordering directly using video frame metadata) and manual annotation (for tasks requiring implicit reasoning such as abductive reasoning).
- Distractor generation: Image-option distractors sampled via heuristic rules; text-option distractors generated by GPT-4o-mini/GPT-4o/Qwen2.5-VL-32B and manually filtered to select the best four incorrect options.
- Three-stage quality control: Difficulty filtering (questions answered correctly by all three small models are removed) → Format filtering (ensuring option length/format consistency to prevent shortcuts) → Quality filtering (manual review to exclude ambiguous, multi-answer, or domain-knowledge-dependent questions).
-
Option format design
- Text options: 1,454 questions (54.95%).
- Image options: 1,192 questions (45.05%).
- Mixed formats prevent models from exploiting purely textual or purely visual shortcuts.
Loss & Training¶
This paper presents an evaluation benchmark and does not involve model training. Evaluation employs a uniform zero-shot CoT prompt; open-source models are evaluated over 5 runs with averaged results to reduce stochastic variance.
Experimental Design¶
Evaluated Models¶
| Category | Representative Models | Count |
|---|---|---|
| Closed-source + Thinking | GPT-5, Gemini-2.5-Pro, o4-mini, Claude-Sonnet-4 | 6 |
| Closed-source + No Thinking | GPT-4.1, GPT-4o, Claude-3.7-Sonnet, Doubao-1.5-vision | 5 |
| Open-source + Thinking | VL-Rethinker-72B, QVQ-72B, MM-Eureka-32B, MiMo-VL-7B | 6 |
| Open-source + No Thinking | Qwen2.5-VL-7/32/72B, Gemma3-12/27B, InternVL3.5-8B/30B | 7+ |
| Human | 12 students of varying education levels, 210-question subset | 12 |
Comparison with Existing Benchmarks¶
| Benchmark | Scale | Image Type | Reasoning Types | Knowledge Req. | Avg. Images |
|---|---|---|---|---|---|
| MME-Reasoning | 1.2K | Symbolic | 3 | Low | 1 |
| VisualPuzzles | 1.1K | Symbolic | 5 | Low | 1 |
| MMMU | 11.5K | Mixed | — | High | 1.05 |
| MMRB | 4.8K | Mixed | 3 | Medium | 6.17 |
| MMR-Life | 2.7K | Natural | 7 | Low | 7.22 |
Results & Analysis¶
Main Results (37 Models)¶
| Model | Abductive | Analogical | Causal | Deductive | Inductive | Spatial | Temporal | Avg. |
|---|---|---|---|---|---|---|---|---|
| Human | 79.76 | 57.65 | 75.00 | 70.59 | 63.41 | 79.76 | 79.76 | 72.28 |
| GPT-5 | 53.75 | 78.87 | 41.06 | 80.14 | 78.32 | 17.25 | 41.70 | 58.69 |
| Gemini-2.5-Pro | 54.40 | 73.77 | 36.99 | 79.43 | 73.66 | 25.10 | 35.79 | 56.86 |
| o4-mini | 41.37 | 73.59 | 27.38 | 71.28 | 68.07 | 19.22 | 32.66 | 50.49 |
| Claude-Sonnet-4 | 36.96 | 60.92 | 44.11 | 67.02 | 56.64 | 15.69 | 28.23 | 45.32 |
| GPT-4.1 | 44.30 | 71.30 | 22.43 | 67.38 | 70.16 | 13.73 | 27.31 | 48.15 |
| Qwen2.5-VL-72B | 35.50 | 55.46 | 35.36 | 52.13 | 55.48 | 12.94 | 23.80 | 40.21 |
| VL-Rethinker-72B | 36.48 | 50.88 | 33.08 | 56.03 | 57.58 | 15.69 | 21.59 | 39.68 |
| InternVL3.5-8B | 35.18 | 11.44 | 18.63 | 34.04 | 11.19 | 14.90 | 16.61 | 18.67 |
Key Findings:
-
⭐⭐⭐ MMR-Life is highly challenging: GPT-5 achieves only 58.69%, trailing the human score of 72.28% by 14 percentage points. Nearly all open-source models fall below 40%, with some (InternVL3.5-8B at 18.67%) approaching random-chance performance (20%).
-
⭐⭐⭐ Large performance disparities across reasoning types: All models perform poorly on spatial reasoning (highest: 25.10% vs. human 79.76%), while some closed-source models surpass humans on analogical and deductive reasoning. Spatial, temporal, and causal reasoning represent significant bottlenecks for current MLLMs.
-
⭐⭐ Open-source thinking models show no improvement: Open-source thinking models average 27.15%, actually lower than the 29.01% of no-thinking models, indicating insufficient generalization of open-source reasoning patterns to real-life scenarios.
Reasoning Paradigm Analysis¶
| Analysis Dimension | Key Finding |
|---|---|
| Thinking length vs. accuracy | Accuracy follows a log-linear relationship with thinking token count, but some open-source thinking models occupy an inefficient region (many tokens, low accuracy). |
| Whether long CoT is always effective | No—CoT degrades performance on inductive reasoning while substantially benefiting analogical reasoning, suggesting long CoT is only suitable for tasks requiring step-by-step derivation. |
| BoN vs. GRPO | BoN@8 generalizes better than GRPO across all model scales; GRPO on large models even underperforms baseline CoT. |
| Correlation across reasoning types | Analogical and inductive reasoning are highly correlated (Pearson \(r=0.97\)); spatial reasoning shows low correlation with all other types (\(r=0.40\)); clustering reveals higher-order reasoning patterns. |
Comparison of Reasoning Enhancement Methods¶
| Model | Method | Abductive | Analogical | Causal | Deductive | Inductive | Spatial | Temporal | Avg. (Δ) |
|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B | CoT | 26.06 | 35.74 | 20.53 | 20.92 | 38.93 | 9.41 | 12.18 | 24.68 |
| Qwen2.5-VL-7B | BoN@8 | 27.64 | 44.72 | 22.81 | 25.53 | 48.02 | 13.33 | 13.10 | 29.54 (+4.86) |
| Qwen2.5-VL-72B | CoT | 35.50 | 55.46 | 35.36 | 52.13 | 55.48 | 12.94 | 23.80 | 40.21 |
| Qwen2.5-VL-72B | BoN@8 | 34.20 | 53.35 | 32.70 | 51.77 | 56.88 | 13.73 | 24.72 | 39.80 (−0.41) |
| Qwen2.5-VL-72B | GRPO | 36.48 | 50.88 | 33.08 | 56.03 | 57.58 | 15.69 | 21.59 | 39.68 (−0.53) |
Key Findings:
-
⭐⭐⭐ Reasoning enhancement methods fail on large models: As model scale increases from 7B→32B→72B, the gains of SC/BoN/GRPO over CoT monotonically decrease. At the 72B scale, both BoN and GRPO underperform baseline CoT, likely because large models already sample correct reasoning paths with sufficiently high probability, yielding diminishing marginal returns from enhancement methods.
-
⭐⭐ RL generalizes less effectively than BoN: Across all model scales, GRPO generalizes more poorly than BoN@8, suggesting that RL-trained models may overfit to specific datasets and fail to generalize to real-life reasoning scenarios.
Error Analysis (GPT-5 & Gemini-2.5-Pro)¶
| Error Type | Proportion | Description |
|---|---|---|
| Reasoning errors | 32% | Causal inversion (24%), temporal confusion (42%), missing key steps (24%) |
| Abstraction errors | 17% | Insufficient short-range thinking; failure in association or generalization |
| Knowledge errors | 17% | Failure to retrieve correct commonsense or world knowledge to support reasoning |
| Perception errors | 12% | Failure to recognize static attributes (color, shape) or dynamic changes (motion) |
Highlights & Insights¶
- ⭐⭐⭐ Fills the gap in real-life multi-image reasoning: MMR-Life is the first benchmark simultaneously satisfying "real-life images + multi-image input + 7 reasoning types," closely aligned with everyday reasoning scenarios.
- ⭐⭐⭐ Reveals critical research findings: The failure of reasoning enhancement methods on large models, insufficient generalization of open-source thinking models, and the conditionality of long CoT effectiveness provide important guidance for future research.
- ⭐⭐ Rigorous data quality control: Three-stage filtering (difficulty/format/quality) combined with manual review reduces shortcut exploitation and data contamination risks.
- ⭐⭐ Reasoning type cluster analysis: Correlation analysis and hierarchical clustering reveal the intrinsic structure of reasoning capabilities (e.g., analogical–inductive shared patterns, independence of spatial reasoning).
- ⭐ Large-scale evaluation: Covers 37 models, including state-of-the-art systems such as GPT-5 and Gemini-2.5-Pro.
Limitations & Future Work¶
- ⭐⭐ Relatively limited scale: 2,646 questions (some reasoning types have only 250+ questions); after subdivision into subtasks, per-task sample sizes are small, potentially affecting statistical significance.
- ⭐⭐ Multiple-choice format only: The five-choice format carries a non-trivial guessing baseline (20%), precluding evaluation of open-ended reasoning capabilities.
- ⭐ Blurred boundaries between reasoning types: The distinction between abductive and causal reasoning may overlap in practice, and some questions may involve multiple reasoning types simultaneously.
- ⭐ Image source diversity: A relatively high proportion of frames extracted from videos and surveillance footage may not fully represent real-life handheld photography scenarios.
- ⭐ Absence of training signal: The benchmark serves solely as an evaluation resource and provides no training set to guide model improvement on weaker reasoning types.
Summary¶
MMR-Life is the first multimodal multi-image reasoning benchmark targeting real-life scenarios, systematically covering 7 reasoning types and 21 tasks. Through large-scale evaluation of 37 MLLMs, it reveals significant bottlenecks in spatial, temporal, and causal reasoning for current models (GPT-5 achieves only 58.69% vs. human 72.28%), and uncovers key insights including the failure of reasoning enhancement methods on large models and the insufficient generalization of open-source thinking models. This benchmark provides an important foundation for evaluating and advancing next-generation multimodal reasoning systems.