Skip to content

Match-and-Fuse: Consistent Generation from Unstructured Image Sets

Conference: CVPR 2026 arXiv: 2511.22287 Area: Image Generation / Consistent Generation Keywords: Set-to-set generation, cross-image consistency, diffusion models, feature fusion, correspondences, training-free, zero-shot

TL;DR

Match-and-Fuse is proposed as the first training-free consistent generation method for unstructured image sets. Images are treated as nodes and image pairs as edges to construct a pairwise consistency graph. Multi-view Feature Fusion (MFF) and feature guidance are employed to manipulate internal features during diffusion inference, achieving set-level cross-image consistency with a DINO-MatchSim of 0.80, substantially outperforming all baselines.

Background & Motivation

Background: Everyday visual experiences are organized as image collections (photo albums, product catalogs, property listings), yet generative AI primarily focuses on single images or videos, leaving set-level consistent generation largely unexplored.

Core Challenges: (a) Image collections lack the temporal continuity of video and thus provide no motion cues; (b) shared content may undergo large deformations; (c) shared elements must remain consistent while non-shared regions are allowed to vary freely.

Limitations of Prior Work: Edicho is limited to pairwise editing propagated from a single reference; IC-LoRA requires LoRA fine-tuning; FLUX Kontext lacks an explicit consistency mechanism; 3D/video editing methods rely on overly strong assumptions.

Key Findings: T2I diffusion models exhibit a grid prior — when multiple images are jointly generated on a concatenated canvas, consistency emerges spontaneously, yet it is incomplete and degrades rapidly as the number of images increases.

Core Idea: Model the image collection as a complete graph and exploit pairwise grid priors together with dense 2D correspondences to perform multi-view feature fusion and guidance at the feature level.

Method

Overall Architecture

The inputs are \(N\) images, \(\mathcal{P}^{shared}\) (a description of shared content), and \(\mathcal{P}^{theme}\) (style/theme). In preprocessing, dense matches \(M_{ij}\) (RoMA) are computed for all image pairs and a VLM generates per-image captions. During inference, joint denoising is performed over the pairwise consistency graph.

Key Designs

  1. Pairwise Consistency Graph:

    • Graph \(G=(V,E)\): nodes represent images; edges connect all image pairs.
    • Each edge corresponds to a two-image grid latent encoding \(z_{ij}^t = \text{concat}(z_i^t, z_j^t)\), paired with concatenated depth maps and a grid prompt.
    • After each denoising step, each node extracts and averages its own latent encoding version from all adjacent edges.
    • Scalability: Node degree is capped at 4 (random neighbors); full connectivity is used for \(N \leq 5\), yielding linear complexity beyond that.
  2. Multi-view Feature Fusion (MFF):

    • Core Finding: The cosine similarity between features at matched locations is strongly correlated with visual consistency.
    • Pairwise fusion: \(\mathbf{f}_i[\mathbf{c}] \leftarrow \frac{1}{2}(\mathbf{f}_i[\mathbf{c}] + \mathbf{f}_j[M_{ij}(\mathbf{c})])\) for all matched coordinates \(\mathbf{c} \in \mathcal{C}_i\).
    • Extension to \(N\) images: features are first averaged across adjacent edges \(\bar{\mathbf{f}}_i = \frac{1}{|\delta(i)|}\sum_{e \in \delta(i)} \mathbf{f}_i^e\), then fused across all images.
    • Applied to K,V feature maps at selected layers of the DiT.
  3. Feature Guidance:

    • Matching feature distance objective: \(L_{guide} = \frac{1}{|E|}\sum_{\{i,j\}\in E}\frac{1}{|M_{ij}|}\sum_{\mathbf{c}\in M_{ij}}\|\mathbf{f}_i[\mathbf{c}] - \mathbf{f}_j[M_{ij}(\mathbf{c})]\|_2\)
    • Gradients with respect to \(z_i^{t-1}\) are computed for light refinement in latent space.
    • MFF can be interpreted as the analytical solution to this objective; Guidance corrects residual inconsistencies.
    • Gradient propagation through the model provides a wider receptive field, yielding robustness to sparse matches.

Input Correspondences

Dense 2D matches are computed using RoMA; shared regions are identified automatically via confidence filtering, requiring no manual masks.

Key Experimental Results

Main Results: Consistency and Prompt Adherence

Method CLIP Score↑ DreamSim↑ DINO-MatchSim↑
FLUX Kontext 0.65 0.78 0.57
IC-LoRA 0.65 0.71 0.65
FLUX 0.67 0.76 0.66
Edicho 0.65 0.81 0.72
Match-and-Fuse 0.66 0.85 0.80
w/o Guidance 0.66 0.82 0.76
w/o MFF 0.66 0.83 0.78
w/o Pairwise Graph 0.66 0.82 0.75

User Study & VLM Evaluation (2AFC, Win Rate of Ours)

Baseline User Preference↑ VLM Preference↑
vs Kontext 88% 82%
vs IC-LoRA 90% 92%
vs FLUX 92% 94%
vs Edicho 83% 78%

Metric Alignment with Human Judgment

Metric Agreement with Humans↑
DreamSim 84.3%
VLM 84.9%
DINO-MatchSim 91.4%

Key Findings

  • DINO-MatchSim of 0.80 substantially surpasses the strongest baseline Edicho at 0.72 (+11.1%).
  • All three components (Graph, MFF, Guidance) are individually necessary.
  • Match-and-Fuse with 9 images achieves higher consistency than baselines operating on only 2 images.
  • Even when matches are as sparse as 10%, DINO-MatchSim remains above 0.76, demonstrating strong robustness.
  • DINO-MatchSim achieves 91.4% alignment with human judgment, substantially exceeding DreamSim at 84.3%.

Highlights & Insights

  • First set-to-set generation method: Extends generative AI to image collections as a fundamental visual unit.
  • Elegant graph formulation: The pairwise consistency graph enables local pairwise operations with global information propagation, and \(O(N^2)\) complexity can be sparsified to \(O(N)\).
  • Discovery and exploitation of the grid prior: The spontaneous consistency emerging from T2I models under grid layouts is a key insight.
  • DINO-MatchSim metric: Matched points in source images are used to localize corresponding positions in generated images, enabling patch-level similarity measurement that is more discriminative than global metrics.
  • Entirely training-free, zero-shot, and requires no manual masks.

Limitations & Future Work

  • Relies on the quality of dense correspondences; regions with few or no matches may remain inconsistent.
  • Depends on the base model's adherence to depth-conditioned inputs.
  • FlowEdit integration requires per-edit hyperparameter tuning.
  • \(O(N^2)\) edge count still incurs overhead for large collections, despite sparsification.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First to define and address set-to-set consistent generation.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Quantitative evaluation + user study + VLM assessment + novel metric + ablation + extended applications.
  • Writing Quality: ⭐⭐⭐⭐⭐ Polished figures, elegant formulations, and clear problem definition.
  • Value: ⭐⭐⭐⭐ Applicable to creative workflows such as product advertising, character design, and storyboarding.