MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models¶
Conference: CVPR 2026
arXiv: 2602.19497
Code: https://github.com/Angusliuuu/MICON-Bench
Area: Image Generation / Multimodal Evaluation
Keywords: Multi-image Context Generation, Unified Multimodal Models, benchmark, Dynamic Attention Rebalancing, Checkpoint Evaluation
TL;DR¶
This paper introduces MICON-Bench, a multi-image context generation benchmark covering 6 tasks (1043 cases) paired with an MLLM-driven Evaluation-by-Checkpoint automated framework. Simultaneously, it proposes DAR (Dynamic Attention Rebalancing), a training-free mechanism that enhances multi-image consistency and generation quality in UMMs by dynamically adjusting inference-time attention weights.
Background & Motivation¶
Background: UMMs have demonstrated the capability to process multi-image inputs and generate contextually consistent visual outputs, represented by models such as Nano-Banana, GPT-Image, BAGEL, and OmniGen2. However, systematic evaluation for multi-image context generation remains lacking.
Limitations of Prior Work: Existing benchmarks (GenEval, T2ICompBench, ImgEdit-Bench) primarily evaluate text-to-image or single-image editing, failing to address cross-image consistency and complex visual relationship reasoning. While OmniContext includes multiple images, it is restricted to simple object compositions.
Key Challenge: UMMs tend to distribute attention uniformly across all regions of all reference images—including irrelevant areas—during multi-image input processing, which leads to hallucinations and inconsistencies.
Core Idea: (a) A set of 6 standardized tasks with a verifiable checkpoint evaluation system; (b) An attention rebalancing mechanism to adjust focus during inference.
Method¶
Overall Architecture¶
The study consists of two components: a comprehensive benchmark and a plug-and-play inference mechanism. To address the lack of systematic evaluation for multi-image context generation (generating consistent new images from multiple reference images), the authors built MICON-Bench, featuring 6 task categories and 1043 cases. This is paired with an MLLM-driven "Evaluation-by-Checkpoint" framework that decomposes each case into verifiable fine-grained scoring points. To solve the issue where UMMs distribute attention too broadly across reference regions, DAR (Dynamic Attention Rebalancing) is proposed to dynamically re-weight attention during inference without requiring additional training.
Key Designs¶
1. MICON-Bench: 6 Tasks Ranging from Simple Composition to Causal Reasoning
Unlike existing benchmarks focused on single-image editing, MICON-Bench decomposes multi-image context generation into 5 composition tasks and 1 complex reasoning task with increasing difficulty:
| Task | Description | Case Count | Ref Images |
|---|---|---|---|
| Object Composition | Single subject + background combination | 200 | 2-3 |
| Spatial Composition | Spatial relationship constraints for multiple objects | 200 | 2-3 |
| Attribute Disentanglement | Decoupled recombination of subject/style/background | 100 | 3 |
| Component Transfer | Transferring parts/accessories across images | 240 | 2-3 |
| FG/BG Composition | Foreground + background fusion | 200 | 2 |
| Story Generation | Causal reasoning to continue a story | 103 | 2-3 |
| Total | 1043 | 2518 images |
2. Evaluation-by-Checkpoint: Decomposing Quality into Pass/Fail Points
To avoid coarse image-level scoring, this framework defines a set of verifiable checkpoints for each case, covering seven dimensions: instruction following, identity consistency, structure, cross-reference consistency, causality, text anchoring, and overall usability. An MLLM (Qwen3-VL-32B) acts as the verifier to judge each point as pass/fail, with the final score being the mean pass rate. The Story task uses an additional predefined answer set for reasoning evaluation.
3. Dynamic Attention Rebalancing (DAR): Redirecting Attention to Key Regions
DAR addresses the diagnostic finding that UMMs pay indiscriminate attention to irrelevant areas. It performs a high-efficiency attention analysis by sampling \(m \ll L_q\) query tokens (default \(m=64\)) and calculating their attention towards reference image key tokens. The total score for each key \(r_k = \sum_{i=1}^{m}\sum_{h=1}^{H} \tilde{A}_{i,h,k}\) is normalized via min-max to obtain \(\hat{r}_k\). Keys are re-weighted based on dual thresholds: crucial keys (\(\hat{r}_k \geq \tau_{high}\)) are scaled by \(w_k = 1+\gamma\), irrelevant keys (\(\hat{r}_k \leq \tau_{low}\)) are suppressed to \(w_k = 1-\gamma\), and others remain unchanged. Attention is then recalculated as \(A = \text{softmax}\left(\frac{Q(w \odot K_{ref})^\top}{\sqrt{d}}\right)\) (default \(\gamma=0.15, \tau_{high}=0.7, \tau_{low}=0.3\)).
Key Experimental Results¶
Main Results: MICON-Bench Task Scores¶
| Model | Object | Spatial | Attribute | Component | FG/BG | Story | Avg↑ |
|---|---|---|---|---|---|---|---|
| Nano-Banana | 95.60 | 93.79 | 92.13 | 84.23 | 83.13 | 82.84 | 89.25 |
| GPT-Image | 96.45 | 94.41 | 93.39 | 87.69 | 85.99 | 91.51 | 90.15 |
| UNO | 58.40 | 66.68 | 65.28 | 28.84 | 20.96 | 39.08 | 44.76 |
| DreamOmni2 | 88.24 | 84.76 | 85.28 | 59.64 | 76.16 | 59.58 | 75.56 |
| BAGEL | 87.64 | 89.96 | 89.84 | 52.40 | 64.64 | 65.09 | 73.55 |
| BAGEL + DAR | 88.04 | 91.88 | 90.76 | 56.06 | 71.24 | 66.34 | 76.31 |
| OmniGen2 | 89.52 | 80.32 | 81.64 | 44.76 | 57.96 | 60.96 | 67.83 |
| OmniGen2 + DAR | 89.84 | 81.00 | 82.12 | 48.72 | 59.28 | 60.73 | 69.21 |
OmniContext Benchmark¶
| Method | SINGLE Char/Obj | MULTIPLE Char/Obj | SCENE Char/Obj | Avg↑ |
|---|---|---|---|---|
| OmniGen2 | 8.18/7.33 | 6.56/7.99 | 6.87/7.90 | 7.53 |
| OmniGen2+DAR | 8.30/8.19 | 6.64/8.42 | 7.06/7.97 | 7.77 |
| BAGEL | 5.71/6.22 | 3.03/6.90 | 4.24/5.16 | 5.54 |
| BAGEL+DAR | 6.26/6.08 | 4.14/7.18 | 4.78/4.84 | 5.80 |
XVerseBench Benchmark¶
| Method | Single-Subject Avg↑ | Multi-Subject Avg↑ | Overall↑ |
|---|---|---|---|
| OmniGen2 | 52.53 | 49.76 | 51.14 |
| OmniGen2+DAR | 53.24 | 50.23 | 51.73 |
| BAGEL | 47.91 | 42.62 | 45.26 |
| BAGEL+DAR | 48.54 | 43.91 | 46.23 |
Key Findings¶
- MICON-Bench effectively differentiates models: GPT-Image is the strongest (90.15), while the diffusion-based UNO is the weakest (44.76).
- DAR provides the most significant boost to BAGEL: Avg +2.76 (73.55→76.31), with a +6.60 gain in FG/BG.
- DAR shows strong generalization with consistent improvements across three different benchmarks (MICON-Bench, OmniContext, XVerseBench).
- Component Transfer and FG/BG are the most challenging tasks, with even top-tier models scored between 84-88.
- A significant gap remains between open-source and closed-source models (BAGEL 73.55 vs GPT-Image 90.15).
Highlights & Insights¶
- First Systematic Multi-Image Context Generation Benchmark: 6 tasks cover the full difficulty spectrum from simple composition to causal reasoning.
- Evaluation-by-Checkpoint Paradigm: Fine-grained, quantifiable, and scalable, providing a more objective measure than traditional image-level metrics.
- Concise and Effective DAR Mechanism: Using only 64 sampled query tokens and dual-threshold re-weighting significantly improves performance without training costs.
- Identifies attention allocation blind spots in UMM multi-image reasoning, providing direction for future model design.
Limitations & Future Work¶
- DAR thresholds (\(\tau_{high}, \tau_{low}\)) and modulation factor (\(\gamma\)) require manual setting; adaptive schemes have not been explored.
- The sample size for the Story Generation task is relatively small (103 cases).
- Benchmark data generated by Qwen-Image + GPT-4o may introduce generative model bias.
- Higher-order requirements such as 3D consistency and temporal continuity were not evaluated.
Rating¶
- Novelty: ⭐⭐⭐⭐ First multi-image context benchmark + plug-and-play DAR.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 7+ models + 3 benchmarks + multiple metrics + comprehensive comparisons.
- Writing Quality: ⭐⭐⭐⭐ Clear task definitions and refined evaluation workflows.
- Value: ⭐⭐⭐⭐ Benchmark drives evaluation standardization; DAR is highly practical.