MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models¶
Conference: CVPR 2026 arXiv: 2602.19497 Code: https://github.com/Angusliuuu/MICON-Bench Area: Image Generation / Multimodal Evaluation Keywords: Multi-image context generation, unified multimodal models, benchmark, dynamic attention rebalancing, checkpoint evaluation
TL;DR¶
This paper proposes MICON-Bench, a multi-image context generation benchmark covering 6 tasks (1,043 cases), paired with an MLLM-driven Evaluation-by-Checkpoint automated assessment framework. It further introduces DAR (Dynamic Attention Rebalancing), a training-free mechanism that improves generation consistency and quality in unified multimodal models (UMMs) by dynamically adjusting attention weights at inference time.
Background & Motivation¶
Background: UMMs can process multi-image inputs and produce contextually consistent visual outputs; representative models include Nano-Banana, GPT-Image, BAGEL, and OmniGen2. However, multi-image context generation capabilities lack systematic evaluation.
Evaluation Gap: Existing benchmarks (GenEval, T2ICompBench, ImgEdit-Bench) primarily assess text-to-image generation or single-image editing, and do not address cross-image consistency or complex visual relational reasoning. OmniContext involves multiple images but is limited to simple subject composition.
Limitations of Prior Work: UMMs tend to distribute attention uniformly across all regions of all reference images, including irrelevant regions, leading to hallucinations and inconsistencies.
Core Idea: (a) Six standardized tasks coupled with a verifiable checkpoint evaluation system; (b) attention rebalancing to redirect focus at inference time.
Method¶
MICON-Bench Benchmark Design¶
6 Tasks (5 compositional + 1 complex reasoning):
| Task | Description | Cases | Ref. Images |
|---|---|---|---|
| Object Composition | Single subject + background composition | 200 | 2–3 |
| Spatial Composition | Multi-object spatial relationship constraints | 200 | 2–3 |
| Attribute Disentanglement | Subject/style/background decoupled recombination | 100 | 3 |
| Component Transfer | Part/accessory transfer across images | 240 | 2–3 |
| FG/BG Composition | Foreground + background fusion | 200 | 2 |
| Story Generation | Causal reasoning story continuation | 103 | 2–3 |
| Total | 1,043 | 2,518 |
Evaluation-by-Checkpoint Framework¶
- Each case has verifiable checkpoints spanning seven dimensions: instruction following, identity consistency, structure, cross-reference consistency, causality, text anchoring, and overall usability.
- An MLLM (Qwen3-VL-32B) serves as the verifier, judging each checkpoint as pass/fail; the final score is the mean pass rate.
- The Story Generation task additionally employs a predefined answer set to evaluate reasoning ability.
Dynamic Attention Rebalancing (DAR)¶
-
Problem Diagnosis: UMM attention indiscriminately attends to irrelevant regions in reference images, causing hallucinations.
-
Efficient Attention Analysis:
- Uniformly sample \(m \ll L_q\) query tokens (default \(m=64\)) and compute attention maps against reference image key tokens.
- Total attention score: \(r_k = \sum_{i=1}^{m}\sum_{h=1}^{H} \tilde{A}_{i,h,k}\)
-
Min-max normalization yields \(\hat{r}_k\).
-
Dynamic Weight Adjustment:
- Dual-threshold three-class partition: \(w_k = 1+\gamma\) (if \(\hat{r}_k \geq \tau_{high}\)), \(w_k = 1-\gamma\) (if \(\hat{r}_k \leq \tau_{low}\)), otherwise \(w_k = 1\).
- Adjusted attention: \(A = \text{softmax}\left(\frac{Q(w \odot K_{ref})^\top}{\sqrt{d}}\right)\)
-
Defaults: \(\gamma=0.15\), \(\tau_{high}=0.7\), \(\tau_{low}=0.3\).
-
Design Advantages: Training-free, plug-and-play, and computationally negligible (only 64 query tokens sampled).
Key Experimental Results¶
Main Results: MICON-Bench Per-Task Scores¶
| Model | Object | Spatial | Attribute | Component | FG/BG | Story | Avg↑ |
|---|---|---|---|---|---|---|---|
| Nano-Banana | 95.60 | 93.79 | 92.13 | 84.23 | 83.13 | 82.84 | 89.25 |
| GPT-Image | 96.45 | 94.41 | 93.39 | 87.69 | 85.99 | 91.51 | 90.15 |
| UNO | 58.40 | 66.68 | 65.28 | 28.84 | 20.96 | 39.08 | 44.76 |
| DreamOmni2 | 88.24 | 84.76 | 85.28 | 59.64 | 76.16 | 59.58 | 75.56 |
| BAGEL | 87.64 | 89.96 | 89.84 | 52.40 | 64.64 | 65.09 | 73.55 |
| BAGEL + DAR | 88.04 | 91.88 | 90.76 | 56.06 | 71.24 | 66.34 | 76.31 |
| OmniGen2 | 89.52 | 80.32 | 81.64 | 44.76 | 57.96 | 60.96 | 67.83 |
| OmniGen2 + DAR | 89.84 | 81.00 | 82.12 | 48.72 | 59.28 | 60.73 | 69.21 |
OmniContext Benchmark¶
| Method | SINGLE Char/Obj | MULTIPLE Char/Obj | SCENE Char/Obj | Avg↑ |
|---|---|---|---|---|
| OmniGen2 | 8.18/7.33 | 6.56/7.99 | 6.87/7.90 | 7.53 |
| OmniGen2+DAR | 8.30/8.19 | 6.64/8.42 | 7.06/7.97 | 7.77 |
| BAGEL | 5.71/6.22 | 3.03/6.90 | 4.24/5.16 | 5.54 |
| BAGEL+DAR | 6.26/6.08 | 4.14/7.18 | 4.78/4.84 | 5.80 |
XVerseBench Benchmark¶
| Method | Single-Subject Avg↑ | Multi-Subject Avg↑ | Overall↑ |
|---|---|---|---|
| OmniGen2 | 52.53 | 49.76 | 51.14 |
| OmniGen2+DAR | 53.24 | 50.23 | 51.73 |
| BAGEL | 47.91 | 42.62 | 45.26 |
| BAGEL+DAR | 48.54 | 43.91 | 46.23 |
Key Findings¶
- MICON-Bench effectively differentiates models: GPT-Image achieves the highest score (90.15), while the diffusion-based UNO scores lowest (44.76).
- DAR yields the most pronounced gains on BAGEL: Avg +2.76 (73.55→76.31), with FG/BG improving by +6.60.
- DAR consistently improves performance across three distinct benchmarks (MICON-Bench, OmniContext, XVerseBench), demonstrating strong generalizability.
- Component Transfer and FG/BG Composition are the most challenging tasks, with even top-tier models scoring only 84–88.
- A substantial gap remains between open-source and closed-source models (BAGEL 73.55 vs. GPT-Image 90.15).
Highlights & Insights¶
- First systematic multi-image context generation benchmark: 6 tasks spanning a complete difficulty spectrum from simple composition to causal reasoning.
- Evaluation-by-Checkpoint paradigm: Fine-grained, quantifiable, and extensible, offering greater objectivity than image-level metrics.
- DAR is concise yet effective: Sampling only 64 query tokens with dual-threshold reweighting achieves significant gains at zero training cost.
- The work exposes attention allocation blind spots in UMMs under multi-image reasoning, providing direction for future model design.
Limitations & Future Work¶
- The DAR thresholds \(\tau_{high}\), \(\tau_{low}\), and modulation factor \(\gamma\) require manual tuning; adaptive strategies remain unexplored.
- The Story Generation task has a relatively small sample size (103 cases).
- Benchmark data generated via Qwen-Image and GPT-4o may introduce generative model bias.
- Higher-order requirements such as 3D consistency and temporal continuity are not evaluated.
Rating¶
- Novelty: ⭐⭐⭐⭐ First multi-image context generation benchmark + plug-and-play DAR
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 7+ models × 3 benchmarks × multiple metrics with comprehensive comparisons
- Writing Quality: ⭐⭐⭐⭐ Task definitions are clear and the evaluation pipeline is well-structured
- Value: ⭐⭐⭐⭐ Benchmark promotes evaluation standardization; DAR is immediately deployable