MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning¶

Conference: ICLR 2026 arXiv: 2603.02024 Code: Project Page Area: Multimodal Evaluation Benchmark Keywords: Multimodal reasoning, multi-image reasoning, real-life scenes, reasoning types, benchmark evaluation

TL;DR¶

This paper introduces MMR-Life, a benchmark comprising 2,646 five-choice multi-image questions based on 19,108 real-life images, covering 7 reasoning types and 21 tasks. It is the first systematic evaluation of MLLMs on multi-image reasoning in real-life scenarios. The strongest model, GPT-5, achieves only 58.69% accuracy—14 percentage points below human performance. Key findings include the failure of reasoning enhancement methods on large models and the weaker generalization of RL compared to BoN.

Background & Motivation¶

Two dominant paradigms exist for MLLM reasoning evaluation, yet both diverge from everyday reasoning scenarios:
- Knowledge-intensive benchmarks (MMMU, GPQA, etc.): Use expert-level STEM questions; professional knowledge is rarely required in daily reasoning.
- Synthetic/symbolic benchmarks (VisualPuzzles, PuzzleVQA, etc.): Use puzzles or symbolic patterns that are far removed from real visual scenes.
Multi-image input is severely underrepresented:
- Most multimodal reasoning benchmarks rely on single-image input (MMMU averages 1.05 images per question), inconsistent with how humans actually acquire information from sequences of images.
- Existing multi-image benchmarks either include non-reasoning tasks or cover only a limited set of reasoning types (e.g., spatial reasoning only).
Core need: A comprehensive benchmark covering diverse reasoning types, grounded in real-life scenes, and supporting multi-image input for MLLM evaluation.

Method¶

Overall Architecture¶

MMR-Life is a multi-image multimodal reasoning evaluation benchmark with the following core design:

Scale: 2,646 five-choice questions based on 19,108 real-life images.
Reasoning coverage: 7 reasoning types across 21 subtasks.
Characteristics: No domain expertise required; questions demand integration of information across multiple images and the application of diverse reasoning skills.
Average image count: 7.22 images per question, far exceeding existing benchmarks.

Key Designs¶

Systematic taxonomy of 7 reasoning types
- Abductive: Inferring the most plausible explanation from observations (307 questions, 11.60%).
- Analogical: Identifying similarities to infer new situations (568 questions, 21.47%).
- Causal: Inferring effects from causes (263 questions, 9.94%).
- Deductive: Deriving specific conclusions from general rules (282 questions, 10.66%).
- Inductive: Generalizing patterns from specific observations (429 questions, 16.21%).
- Spatial: Understanding object positions and spatial relationships (255 questions, 9.64%).
- Temporal: Reasoning about event order and timing (542 questions, 20.48%).
Data collection pipeline (multi-source + multi-stage quality control)
- Sources: Public image datasets (Kaggle), open web resources (eBird, etc.), public video sources (frame extraction), and existing benchmarks.
- Question generation: Rule-based automatic synthesis (e.g., temporal ordering directly using video frame metadata) and manual annotation (for tasks requiring implicit reasoning such as abductive reasoning).
- Distractor generation: Image-option distractors sampled via heuristic rules; text-option distractors generated by GPT-4o-mini/GPT-4o/Qwen2.5-VL-32B and manually filtered to select the best four incorrect options.
- Three-stage quality control: Difficulty filtering (questions answered correctly by all three small models are removed) → Format filtering (ensuring option length/format consistency to prevent shortcuts) → Quality filtering (manual review to exclude ambiguous, multi-answer, or domain-knowledge-dependent questions).
Option format design
- Text options: 1,454 questions (54.95%).
- Image options: 1,192 questions (45.05%).
- Mixed formats prevent models from exploiting purely textual or purely visual shortcuts.

Loss & Training¶

This paper presents an evaluation benchmark and does not involve model training. Evaluation employs a uniform zero-shot CoT prompt; open-source models are evaluated over 5 runs with averaged results to reduce stochastic variance.

Experimental Design¶

Evaluated Models¶

Category	Representative Models	Count
Closed-source + Thinking	GPT-5, Gemini-2.5-Pro, o4-mini, Claude-Sonnet-4	6
Closed-source + No Thinking	GPT-4.1, GPT-4o, Claude-3.7-Sonnet, Doubao-1.5-vision	5
Open-source + Thinking	VL-Rethinker-72B, QVQ-72B, MM-Eureka-32B, MiMo-VL-7B	6
Open-source + No Thinking	Qwen2.5-VL-7/32/72B, Gemma3-12/27B, InternVL3.5-8B/30B	7+
Human	12 students of varying education levels, 210-question subset	12

Comparison with Existing Benchmarks¶

Benchmark	Scale	Image Type	Reasoning Types	Knowledge Req.	Avg. Images
MME-Reasoning	1.2K	Symbolic	3	Low	1
VisualPuzzles	1.1K	Symbolic	5	Low	1
MMMU	11.5K	Mixed	—	High	1.05
MMRB	4.8K	Mixed	3	Medium	6.17
MMR-Life	2.7K	Natural	7	Low	7.22

Results & Analysis¶

Main Results (37 Models)¶

Model	Abductive	Analogical	Causal	Deductive	Inductive	Spatial	Temporal	Avg.
Human	79.76	57.65	75.00	70.59	63.41	79.76	79.76	72.28
GPT-5	53.75	78.87	41.06	80.14	78.32	17.25	41.70	58.69
Gemini-2.5-Pro	54.40	73.77	36.99	79.43	73.66	25.10	35.79	56.86
o4-mini	41.37	73.59	27.38	71.28	68.07	19.22	32.66	50.49
Claude-Sonnet-4	36.96	60.92	44.11	67.02	56.64	15.69	28.23	45.32
GPT-4.1	44.30	71.30	22.43	67.38	70.16	13.73	27.31	48.15
Qwen2.5-VL-72B	35.50	55.46	35.36	52.13	55.48	12.94	23.80	40.21
VL-Rethinker-72B	36.48	50.88	33.08	56.03	57.58	15.69	21.59	39.68
InternVL3.5-8B	35.18	11.44	18.63	34.04	11.19	14.90	16.61	18.67

Key Findings:

⭐⭐⭐ MMR-Life is highly challenging: GPT-5 achieves only 58.69%, trailing the human score of 72.28% by 14 percentage points. Nearly all open-source models fall below 40%, with some (InternVL3.5-8B at 18.67%) approaching random-chance performance (20%).
⭐⭐⭐ Large performance disparities across reasoning types: All models perform poorly on spatial reasoning (highest: 25.10% vs. human 79.76%), while some closed-source models surpass humans on analogical and deductive reasoning. Spatial, temporal, and causal reasoning represent significant bottlenecks for current MLLMs.
⭐⭐ Open-source thinking models show no improvement: Open-source thinking models average 27.15%, actually lower than the 29.01% of no-thinking models, indicating insufficient generalization of open-source reasoning patterns to real-life scenarios.

Reasoning Paradigm Analysis¶

Analysis Dimension	Key Finding
Thinking length vs. accuracy	Accuracy follows a log-linear relationship with thinking token count, but some open-source thinking models occupy an inefficient region (many tokens, low accuracy).
Whether long CoT is always effective	No—CoT degrades performance on inductive reasoning while substantially benefiting analogical reasoning, suggesting long CoT is only suitable for tasks requiring step-by-step derivation.
BoN vs. GRPO	BoN@8 generalizes better than GRPO across all model scales; GRPO on large models even underperforms baseline CoT.
Correlation across reasoning types	Analogical and inductive reasoning are highly correlated (Pearson \(r=0.97\)); spatial reasoning shows low correlation with all other types (\(r=0.40\)); clustering reveals higher-order reasoning patterns.

Comparison of Reasoning Enhancement Methods¶

Model	Method	Abductive	Analogical	Causal	Deductive	Inductive	Spatial	Temporal	Avg. (Δ)
Qwen2.5-VL-7B	CoT	26.06	35.74	20.53	20.92	38.93	9.41	12.18	24.68
Qwen2.5-VL-7B	BoN@8	27.64	44.72	22.81	25.53	48.02	13.33	13.10	29.54 (+4.86)
Qwen2.5-VL-72B	CoT	35.50	55.46	35.36	52.13	55.48	12.94	23.80	40.21
Qwen2.5-VL-72B	BoN@8	34.20	53.35	32.70	51.77	56.88	13.73	24.72	39.80 (−0.41)
Qwen2.5-VL-72B	GRPO	36.48	50.88	33.08	56.03	57.58	15.69	21.59	39.68 (−0.53)

Key Findings:

⭐⭐⭐ Reasoning enhancement methods fail on large models: As model scale increases from 7B→32B→72B, the gains of SC/BoN/GRPO over CoT monotonically decrease. At the 72B scale, both BoN and GRPO underperform baseline CoT, likely because large models already sample correct reasoning paths with sufficiently high probability, yielding diminishing marginal returns from enhancement methods.
⭐⭐ RL generalizes less effectively than BoN: Across all model scales, GRPO generalizes more poorly than BoN@8, suggesting that RL-trained models may overfit to specific datasets and fail to generalize to real-life reasoning scenarios.

Error Analysis (GPT-5 & Gemini-2.5-Pro)¶

Error Type	Proportion	Description
Reasoning errors	32%	Causal inversion (24%), temporal confusion (42%), missing key steps (24%)
Abstraction errors	17%	Insufficient short-range thinking; failure in association or generalization
Knowledge errors	17%	Failure to retrieve correct commonsense or world knowledge to support reasoning
Perception errors	12%	Failure to recognize static attributes (color, shape) or dynamic changes (motion)

Highlights & Insights¶

⭐⭐⭐ Fills the gap in real-life multi-image reasoning: MMR-Life is the first benchmark simultaneously satisfying "real-life images + multi-image input + 7 reasoning types," closely aligned with everyday reasoning scenarios.
⭐⭐⭐ Reveals critical research findings: The failure of reasoning enhancement methods on large models, insufficient generalization of open-source thinking models, and the conditionality of long CoT effectiveness provide important guidance for future research.
⭐⭐ Rigorous data quality control: Three-stage filtering (difficulty/format/quality) combined with manual review reduces shortcut exploitation and data contamination risks.
⭐⭐ Reasoning type cluster analysis: Correlation analysis and hierarchical clustering reveal the intrinsic structure of reasoning capabilities (e.g., analogical–inductive shared patterns, independence of spatial reasoning).
⭐ Large-scale evaluation: Covers 37 models, including state-of-the-art systems such as GPT-5 and Gemini-2.5-Pro.

Limitations & Future Work¶

⭐⭐ Relatively limited scale: 2,646 questions (some reasoning types have only 250+ questions); after subdivision into subtasks, per-task sample sizes are small, potentially affecting statistical significance.
⭐⭐ Multiple-choice format only: The five-choice format carries a non-trivial guessing baseline (20%), precluding evaluation of open-ended reasoning capabilities.
⭐ Blurred boundaries between reasoning types: The distinction between abductive and causal reasoning may overlap in practice, and some questions may involve multiple reasoning types simultaneously.
⭐ Image source diversity: A relatively high proportion of frames extracted from videos and surveillance footage may not fully represent real-life handheld photography scenarios.
⭐ Absence of training signal: The benchmark serves solely as an evaluation resource and provides no training set to guide model improvement on weaker reasoning types.

Summary¶

MMR-Life is the first multimodal multi-image reasoning benchmark targeting real-life scenarios, systematically covering 7 reasoning types and 21 tasks. Through large-scale evaluation of 37 MLLMs, it reveals significant bottlenecks in spatial, temporal, and causal reasoning for current models (GPT-5 achieves only 58.69% vs. human 72.28%), and uncovers key insights including the failure of reasoning enhancement methods on large models and the insufficient generalization of open-source thinking models. This benchmark provides an important foundation for evaluating and advancing next-generation multimodal reasoning systems.

MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Experimental Design¶

Evaluated Models¶

Comparison with Existing Benchmarks¶

Results & Analysis¶

Main Results (37 Models)¶

Reasoning Paradigm Analysis¶

Comparison of Reasoning Enhancement Methods¶

Error Analysis (GPT-5 & Gemini-2.5-Pro)¶

Highlights & Insights¶

Limitations & Future Work¶

Summary¶

Related Papers¶