MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition¶
Conference: CVPR 2026
Paper: CVF Open Access
Code: To be confirmed (Paper states dataset and benchmark will be open-sourced)
Area: Image Generation / Multi-Image Composition Dataset
Keywords: Multi-image composition, identity consistency, controllable generation, dataset, evaluation metrics
TL;DR¶
To address the lack of high-quality training data for Multi-Image Composition (MICo)—the task of synthesizing people, objects, clothing, and scenes from multiple reference images into a single coherent image—this work constructs the MICo-150K dataset (containing 150,000 identity-consistent samples) and the MICo-Bench. The construction utilizes the proprietary Nano-Banana model combined with a Compose-by-Retrieval prompt strategy, human-in-the-loop filtering, and a "Decompose-and-Recompose" (De&Re) workflow. Furthermore, a Weighted-Ref-VIEScore metric is proposed. Fine-tuning multiple open-source T2I models on this dataset significantly enhances their MICo capabilities, approaching the performance of closed-source models.
Background & Motivation¶
Background: Text-to-Image (T2I) and Image-to-Image (I2I) synthesis can now produce realistic results. Personalized or contextual generation (maintaining identity consistency with reference images) is one of the most valuable capabilities. Recent works like FLUX.Kontext and Qwen-Image have shown significant progress with single reference image inputs.
Limitations of Prior Work: Most existing systems support only single reference image inputs and struggle to integrate multiple entities (multiple people, objects, clothes, scenes) into one coherent composite image. In the realm of true "Multi-Image Composition" (MICo), a clear gap exists between the open-source community and closed-source models like GPT-Image-1, Nano-Banana, and Seedream 4.0. A root cause is the scarcity of high-quality datasets specifically tailored for this task.
Key Challenge: Existing MICo datasets have two major flaws: (1) Many source/target image pairs are generated by a few fixed T2I models, leading to homogenized content and a quality gap compared to closed-source models; (2) Data based on real photos or video frames has limited diversity, lacks imaginative scenarios, and is heavily "person-centric," with insufficient coverage of "object-centric" and multi-subject scenes. The early paradigm of "using GroundingDINO+SAM to segment instances from a whole image as sources and the original as the target" often produces incomplete or semantically ambiguous samples.
Goal: Construct a broad-coverage, high-quality, identity-consistent MICo dataset, complemented by a dedicated evaluation benchmark and reliable metrics to advance the challenging and underexplored MICo task.
Key Insight: Rather than using weak generative backbones to batch-produce homogeneous data, it is better to directly use the strongest closed-source model (Nano-Banana) to synthesize targets, while employing retrieval at the front-end to ensure "semantic compatibility" of source combinations and VLM + manual dual filtering at the back-end to control quality.
Core Idea: A pipeline comprising "High-quality source collection → Compose-by-Retrieval for semantically compatible combinations → Synthesis via strong closed-source models → Automated + manual verification" is used to generate data. Additionally, a "Decompose-and-Recompose" (De&Re) track is introduced to allow the data to feature both real-world and synthetic compositions.
Method¶
Overall Architecture¶
MICo-150K is a dataset and benchmark effort centered on a data construction and evaluation pipeline rather than a new model architecture. The task is systematized into 3 categories, 7 sub-tasks, and 27 fine-grained types (Object-centric: Object+Object, Object+Scene; Person-centric: Person+Person, Person+Scene; Human-Object Interaction (HOI): Person+Object, Person+Clothes, Person+Object+Clothes), plus an independent "Decompose-and-Recompose" (De&Re) track.
The pipeline consists of four steps: ① Source collection and cleaning—collecting object, person, clothing, and scene source images from public datasets (Subject200k, VITON-HD, Headshot, SUN397, etc.), filtering low-quality/ambiguous images with Qwen2.5-VL-72B, and removing redundancy via DINO-v3+SigLIP2 features + DBSCAN clustering; ② Compose-by-Retrieval prompting—instead of random pairings (which might lead to incompatible sets like "male athlete with high heels"), GPT-4o selects the most semantically compatible combinations from candidates and generates natural, coherent synthesis prompts; ③ Synthesis and verification—prompts are fed into the closed-source Nano-Banana model to synthesize target images, followed by Qwen2.5-VL-72B verification of entity presence and ArcFace verification of facial identity consistency; ④ De&Re track—real single-person photos are decomposed by Nano-Banana into "person/clothes/object/scene" components, manually verified, and سپس recomposed. This results in each component set having two versions: a "real composition" and a "recomposed synthetic composition." For evaluation, MICo-Bench (1,000 cases) was constructed along with the Weighted-Ref-VIEScore metric.
Key Designs¶
1. Task Taxonomy + High-quality Source Collection/Deduplication: Multi-image composition is broken down into controllable fine-grained sub-tasks. A taxonomy of 3 categories / 7 sub-tasks / 27 fine-grained types (e.g., 1O1S, 2O1S under Object+Scene; 2M, 2W, 1M1W under Person+Person) ensures clear source sampling rules for each combination. Qwen2.5-VL-72B filters the collected sources, and DBSCAN clustering using concatenated DINO-v3 and SigLIP2 features ensures only representative images from each visual-semantic cluster are retained to eliminate redundancy.
2. Compose-by-Retrieval: To avoid poor synthesis results from semantically incompatible random samples, a subject image is first determined, and candidates are sampled from other pools. GPT-4o then selects the most compatible combination based on images and their detailed captions. Furthermore, instead of simply concatenating captions, GPT-4o generates more coherent, natural synthesis prompts and provides explicit "token → source image" mappings for future latent space alignment research.
3. Decompose-and-Recompose (De&Re): To incorporate the complexity of the real world, high-quality real photos from CC12M are decomposed into independent components using Nano-Banana. After human-in-the-loop refinement to fix extraction failures (e.g., identity loss or lack of variation), these components are recomposed. This naturally yields a pair for each set of components: one real-world composition (the original photo) and one recomposed synthetic composition (11,677 cases in total).
4. Weighted-Ref-VIEScore: Traditional VIEScore (\(\text{SC} \times \text{PQ}\)) often fails when VLMs suffer from cross-image attention overload during multi-source evaluation. Weighted-Ref-VIEScore solves this through: Weighting—each non-human source is paired with the generated image for Qwen-VL2.5-72B to judge presence, while human sources are verified via ArcFace, yielding a contribution weight \(W\); and a Reference Mechanism—GPT-4o performs pairwise comparisons between the generated image and a manually verified reference image (produced by Nano-Banana) to obtain a more accurate semantic consistency (SC) score. The final score is defined as:
where SR is subject similarity, PF is prompt following, and PQ is perceptual quality.
Key Experimental Results¶
Dataset Scale¶
| Category | Sub-task | Representative Types | Quantity (Approx) |
|---|---|---|---|
| Object-centric | Object+Scene | 1O1S / 2O1S | 5,014 / 4,999 |
| Object-centric | Object+Object | 2O / 3O / 4O / 5O | ~10k each / 5k each |
| Person-centric | Person+Person | Various gender mixes | Total ~24k |
| Person-centric | Person+Scene | 1P1S / 2P1S | 4,986 / 4,994 |
| HOI | P+O / P+C / P+O+C | Various variants | ~20k–28k each |
| De&Re | De&Re | Adaptive | 11,677 |
Main Results (MICo-Bench, Overall Score, Excerpt from Table 2)¶
| Model | Base | w/o De&Re | real | synth |
|---|---|---|---|---|
| BLIP3-o | 2.2 | 42.2 | 43.2 | 43.0 |
| BAGEL | 33.3 | 42.6 | 44.3 | 44.1 |
| Qwen-MICo (Ours) | 38.5 | 56.4 | 58.2 | 58.1 |
| GPT-4o (Closed) | 59.6 | – | – | – |
| Nano-Banana (Closed) | 60.3 | – | – | – |
Note: Qwen-MICo approaches the performance of Qwen-Image-2509 but supports arbitrary multi-image inputs (the latter is limited to 3).
Key Findings¶
- Emergent MICo capabilities in strong I2I models: Models like BAGEL and Qwen-Image-Edit, though never trained on multi-image composition, show inherent MICo abilities when multiple source tokens are concatenated. SFT significantly amplifies this.
- Synthetic data can replace real data: Training with De&Re real targets versus synthetic targets yields nearly identical results, reducing the cost of acquiring high-quality MICo data.
- Closed-source models have distinct strengths: Nano-Banana scores higher quantitatively, but GPT-4o shows better robustness with fewer artifacts like incomplete limbs or identity loss.
Highlights & Insights¶
- The recipe of "Strong closed-source model as a data factory + Retrieval for compatibility + Human-in-the-loop quality control" effectively solves source homogeneity and combination incompatibility.
- The De&Re dual-target design provides a natural framework for "Real vs. Synthetic" ablation, directly proving the utility of synthetic data.
- Weighted-Ref-VIEScore addresses the "cross-image attention overload" in VLM evaluation by using pairwise reference-based scoring, providing a blueprint for evaluating multi-input-to-single-output tasks.
Limitations & Future Work¶
- Reliance on closed-source models: Synthesis, prompt generation, and verification rely heavily on proprietary models (Nano-Banana, GPT-4o), leading to high reproduction costs and potential bias inheritance.
- Evaluation bias: While Weighted-Ref-VIEScore mitigates attention issues, it still relies on GPT-4o as a judge and Nano-Banana for reference images.
- Task Boundaries: The taxonomy focuses on people, objects, clothes, and scenes; coverage of abstract interactions or complex physical relationships remains limited.
Related Work & Insights¶
- Compared to segmentation-based methods (Subject Diffusion, MS-Diffusion) that yield incomplete sources via GroundingDINO+SAM, Ours utilizes cleaner collection and retrieval for superior compatibility.
- Unlike UNO or OmniGen2 which suffer from content homogeneity due to weaker generative backbones, Ours leverages the strongest closed-source models and human-in-the-loop refinement.
- Among high-quality non-segmentation datasets, MICo-150K significantly exceeds Echo-4o in source and prompt diversity.
Rating¶
- Novelty: ⭐⭐⭐⭐ Solid combination of task systematization, retrieval-based prompting, and De&Re.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extensive validation across 5 heterogeneous open-source models, including real/synthetic ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear presentation of pipeline and metrics.
- Value: ⭐⭐⭐⭐⭐ Fills a critical gap in open-source MICo training data and evaluation benchmarks.
Related Papers¶
- [CVPR 2026] PhotoFramer: Multi-modal Image Composition Instruction
- [CVPR 2025] ORIDa: Object-Centric Real-World Image Composition Dataset
- [CVPR 2026] ConsistCompose: Unified Multimodal Layout Control for Image Composition
- [CVPR 2026] ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
- [CVPR 2026] MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models