Match-and-Fuse: Consistent Generation from Unstructured Image Sets¶
Conference: CVPR 2026 arXiv: 2511.22287 Area: Image Generation / Consistent Generation Keywords: Set-to-set generation, cross-image consistency, diffusion models, feature fusion, correspondences, training-free, zero-shot
TL;DR¶
Match-and-Fuse is proposed as the first training-free consistent generation method for unstructured image sets. Images are treated as nodes and image pairs as edges to construct a pairwise consistency graph. Multi-view Feature Fusion (MFF) and feature guidance are employed to manipulate internal features during diffusion inference, achieving set-level cross-image consistency with a DINO-MatchSim of 0.80, substantially outperforming all baselines.
Background & Motivation¶
Background: Everyday visual experiences are organized as image collections (photo albums, product catalogs, property listings), yet generative AI primarily focuses on single images or videos, leaving set-level consistent generation largely unexplored.
Core Challenges: (a) Image collections lack the temporal continuity of video and thus provide no motion cues; (b) shared content may undergo large deformations; (c) shared elements must remain consistent while non-shared regions are allowed to vary freely.
Limitations of Prior Work: Edicho is limited to pairwise editing propagated from a single reference; IC-LoRA requires LoRA fine-tuning; FLUX Kontext lacks an explicit consistency mechanism; 3D/video editing methods rely on overly strong assumptions.
Key Findings: T2I diffusion models exhibit a grid prior — when multiple images are jointly generated on a concatenated canvas, consistency emerges spontaneously, yet it is incomplete and degrades rapidly as the number of images increases.
Core Idea: Model the image collection as a complete graph and exploit pairwise grid priors together with dense 2D correspondences to perform multi-view feature fusion and guidance at the feature level.
Method¶
Overall Architecture¶
The inputs are \(N\) images, \(\mathcal{P}^{shared}\) (a description of shared content), and \(\mathcal{P}^{theme}\) (style/theme). In preprocessing, dense matches \(M_{ij}\) (RoMA) are computed for all image pairs and a VLM generates per-image captions. During inference, joint denoising is performed over the pairwise consistency graph.
Key Designs¶
-
Pairwise Consistency Graph:
- Graph \(G=(V,E)\): nodes represent images; edges connect all image pairs.
- Each edge corresponds to a two-image grid latent encoding \(z_{ij}^t = \text{concat}(z_i^t, z_j^t)\), paired with concatenated depth maps and a grid prompt.
- After each denoising step, each node extracts and averages its own latent encoding version from all adjacent edges.
- Scalability: Node degree is capped at 4 (random neighbors); full connectivity is used for \(N \leq 5\), yielding linear complexity beyond that.
-
Multi-view Feature Fusion (MFF):
- Core Finding: The cosine similarity between features at matched locations is strongly correlated with visual consistency.
- Pairwise fusion: \(\mathbf{f}_i[\mathbf{c}] \leftarrow \frac{1}{2}(\mathbf{f}_i[\mathbf{c}] + \mathbf{f}_j[M_{ij}(\mathbf{c})])\) for all matched coordinates \(\mathbf{c} \in \mathcal{C}_i\).
- Extension to \(N\) images: features are first averaged across adjacent edges \(\bar{\mathbf{f}}_i = \frac{1}{|\delta(i)|}\sum_{e \in \delta(i)} \mathbf{f}_i^e\), then fused across all images.
- Applied to K,V feature maps at selected layers of the DiT.
-
Feature Guidance:
- Matching feature distance objective: \(L_{guide} = \frac{1}{|E|}\sum_{\{i,j\}\in E}\frac{1}{|M_{ij}|}\sum_{\mathbf{c}\in M_{ij}}\|\mathbf{f}_i[\mathbf{c}] - \mathbf{f}_j[M_{ij}(\mathbf{c})]\|_2\)
- Gradients with respect to \(z_i^{t-1}\) are computed for light refinement in latent space.
- MFF can be interpreted as the analytical solution to this objective; Guidance corrects residual inconsistencies.
- Gradient propagation through the model provides a wider receptive field, yielding robustness to sparse matches.
Input Correspondences¶
Dense 2D matches are computed using RoMA; shared regions are identified automatically via confidence filtering, requiring no manual masks.
Key Experimental Results¶
Main Results: Consistency and Prompt Adherence¶
| Method | CLIP Score↑ | DreamSim↑ | DINO-MatchSim↑ |
|---|---|---|---|
| FLUX Kontext | 0.65 | 0.78 | 0.57 |
| IC-LoRA | 0.65 | 0.71 | 0.65 |
| FLUX | 0.67 | 0.76 | 0.66 |
| Edicho | 0.65 | 0.81 | 0.72 |
| Match-and-Fuse | 0.66 | 0.85 | 0.80 |
| w/o Guidance | 0.66 | 0.82 | 0.76 |
| w/o MFF | 0.66 | 0.83 | 0.78 |
| w/o Pairwise Graph | 0.66 | 0.82 | 0.75 |
User Study & VLM Evaluation (2AFC, Win Rate of Ours)¶
| Baseline | User Preference↑ | VLM Preference↑ |
|---|---|---|
| vs Kontext | 88% | 82% |
| vs IC-LoRA | 90% | 92% |
| vs FLUX | 92% | 94% |
| vs Edicho | 83% | 78% |
Metric Alignment with Human Judgment¶
| Metric | Agreement with Humans↑ |
|---|---|
| DreamSim | 84.3% |
| VLM | 84.9% |
| DINO-MatchSim | 91.4% |
Key Findings¶
- DINO-MatchSim of 0.80 substantially surpasses the strongest baseline Edicho at 0.72 (+11.1%).
- All three components (Graph, MFF, Guidance) are individually necessary.
- Match-and-Fuse with 9 images achieves higher consistency than baselines operating on only 2 images.
- Even when matches are as sparse as 10%, DINO-MatchSim remains above 0.76, demonstrating strong robustness.
- DINO-MatchSim achieves 91.4% alignment with human judgment, substantially exceeding DreamSim at 84.3%.
Highlights & Insights¶
- First set-to-set generation method: Extends generative AI to image collections as a fundamental visual unit.
- Elegant graph formulation: The pairwise consistency graph enables local pairwise operations with global information propagation, and \(O(N^2)\) complexity can be sparsified to \(O(N)\).
- Discovery and exploitation of the grid prior: The spontaneous consistency emerging from T2I models under grid layouts is a key insight.
- DINO-MatchSim metric: Matched points in source images are used to localize corresponding positions in generated images, enabling patch-level similarity measurement that is more discriminative than global metrics.
- Entirely training-free, zero-shot, and requires no manual masks.
Limitations & Future Work¶
- Relies on the quality of dense correspondences; regions with few or no matches may remain inconsistent.
- Depends on the base model's adherence to depth-conditioned inputs.
- FlowEdit integration requires per-edit hyperparameter tuning.
- \(O(N^2)\) edge count still incurs overhead for large collections, despite sparsification.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to define and address set-to-set consistent generation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Quantitative evaluation + user study + VLM assessment + novel metric + ablation + extended applications.
- Writing Quality: ⭐⭐⭐⭐⭐ Polished figures, elegant formulations, and clear problem definition.
- Value: ⭐⭐⭐⭐ Applicable to creative workflows such as product advertising, character design, and storyboarding.