GeoComplete: Geometry-Aware Diffusion for Reference-Driven Image Completion¶

Conference: NeurIPS 2025 arXiv: 2510.03110 Code: N/A Area: 3D Vision Keywords: image completion, diffusion models, geometry guidance, point cloud projection, reference image

TL;DR¶

This paper proposes GeoComplete, which injects projected point clouds as geometric conditions into a dual-branch diffusion model and employs a target-aware masking strategy to achieve geometrically consistent reference-driven image completion, achieving a 17.1% improvement in PSNR.

Background & Motivation¶

Reference-driven image completion leverages images from other viewpoints of the same scene to restore missing regions in a target image. This task is particularly challenging when the target and reference viewpoints differ significantly.

Limitations of existing methods:

Traditional geometry-based methods (TransFill, GeoFill): adopt a sequential pipeline of pose estimation → depth reconstruction → 3D warping → patch blending → image harmonization, where early-stage errors cascade and amplify, leading to failures under occlusion, dynamic content, or ambiguous geometry.

Generative methods (RealFill): fine-tune diffusion models via LoRA to directly synthesize missing regions, but lack geometric cues (e.g., camera pose, depth), causing hallucinated structures or misaligned content when viewpoint differences are large.

Key Challenge: There is a fundamental tension between generative capacity (handling complex scenes) and geometric consistency (maintaining spatial alignment).

Method¶

Overall Architecture¶

GeoComplete consists of three core components:

Point Cloud Generation Module: estimates camera parameters and depth maps from reference and target images, constructs a 3D point cloud, and projects it.
Dual-Branch Diffusion Model: a target branch processes the masked image while a cloud branch processes the projected point cloud; the two branches are fused via joint self-attention.
Target-aware Masking: guides the model to focus on regions in the reference images that are informative for the target viewpoint.

Key Designs¶

1. Point Cloud Generation¶

Dynamic Object Filtering: LangSAM (SAM 2.1-Large with text prompts) is used to segment and remove dynamic regions (e.g., pedestrians, vehicles). Text prompts can be user-provided or automatically generated by an LLM.

Geometry Estimation: VGGT (Visual Geometry Grounded Transformer) jointly predicts in a single forward pass: - Camera parameters \(\{\mathbf{c}_i^{ref}\}\), \(\mathbf{c}^{tar}\) - Depth maps \(\{\mathbf{d}_i^{ref}\}\), \(\mathbf{d}^{tar}\)

VGGT eliminates the multi-stage error accumulation of traditional pipelines.

Point Cloud Projection: For each reference image, a point cloud is constructed from other views and projected:

\[\mathbf{p}_i^{ref} = \pi(\pi^{-1}(\{\mathbf{d}_j^{ref}, \mathbf{c}_j^{ref} | j \neq i\} \cup \{\mathbf{d}^{tar}, \mathbf{c}^{tar}\}), \mathbf{c}_i^{ref})\]

The projected point cloud for the target view: \(\mathbf{p}^{tar} = \pi(\pi^{-1}(\{\mathbf{d}_j^{ref}, \mathbf{c}_j^{ref} | \forall j\}), \mathbf{c}^{tar})\)

2. Target-aware Masking (Core Innovation)¶

The target view is projected onto each reference view to identify informative regions (visible in the reference but missing in the target) and redundant regions.

Conditional reference masking (guiding the model to learn from complementary content):

\[\hat{\mathbf{x}}_i^{ref} = \mathbf{x}_i^{ref} \odot ((1 - \mathbf{r}_i^{ref}) + \mathbf{r}_i^{ref} \odot \mathbf{m}_i^{rand})\]

Redundant regions are preserved while informative regions are randomly masked, driving the model to learn to complete complementary information.

Conditional cloud masking (guiding the model to leverage geometric cues):

\[\hat{\mathbf{p}}_i^{ref} = \mathbf{p}_i^{ref} \odot \mathbf{m}_i^{point} + v_{fill} \times (1 - \mathbf{m}_i^{point})\]

Geometric information in informative regions is preserved while redundant regions are randomly masked, causing the model to rely on geometric cues where visual information is absent.

The complementary design of the two masking strategies is elegant: the reference image masks informative regions → the model must learn from geometry; the point cloud retains informative regions → geometric guidance is provided.

3. Dual-Branch Diffusion Model¶

Built upon Stable Diffusion 2 Inpainting and fine-tuned via LoRA (rank=8):

Target Branch: encodes the masked target image and generates the missing region.
Cloud Branch: encodes the projected point cloud to provide geometric guidance.

Joint Self-Attention: the latent features of both branches are concatenated as \(\mathbf{h}_{cat} \in \mathbb{R}^{2L \times d}\), and a controlled attention mask is applied: 1. Tokens within each branch can attend to one another. 2. Each token in the target branch can attend to the spatially corresponding token in the cloud branch. 3. All other cross-branch interactions are blocked.

This design ensures that masked tokens in the target branch (which lack meaningful visual information) can directly receive geometric guidance from the corresponding spatial positions.

Loss & Training¶

Diffusion loss:

\[\mathcal{L} = \frac{1}{B} \sum_{j=1}^{B} \mathbb{E}_{t,\epsilon}\left[\|\mathbf{w}_j \cdot (\epsilon - \epsilon_\theta(\mathbf{x}_j(t), t, \hat{\mathbf{p}}_j, \hat{\mathbf{x}}_j))\|_2^2\right]\]

\(\mathbf{w}_j\) is the valid-region weight; the loss is computed only over visible regions.

Per-scene fine-tuning: 2000 iterations, batch size 16.
LoRA rank=8; training images resized to 512×512.
VGGT input: 518×518 (center-cropped).

Key Experimental Results¶

Main Results¶

RealBench dataset (33 scenes):

Method	PSNR↑	SSIM↑	LPIPS↓	DreamSim↓	DINO↑	CLIP↑
SD Inpaint	10.63	0.282	0.605	0.213	0.831	0.874
Generative Fill	10.92	0.311	0.598	0.212	0.851	0.898
Paint-by-Example	10.13	0.244	0.642	0.237	0.797	0.859
TransFill	13.28	0.404	0.542	0.192	0.860	0.866
RealFill	14.78	0.424	0.431	0.077	0.948	0.962
GeoComplete	17.32	0.578	0.197	0.036	0.986	0.987

User study (QualBench, 25 scenes, 1–5 scale): GeoComplete 4.61 vs. RealFill 3.98.

Ablation Study¶

Dual-Branch	Joint Self-Attn	Target-aware	PSNR↑	SSIM↑	LPIPS↓	DINO↑
✗	✗	✗	14.78	0.424	0.431	0.948
✓	✗	✗	16.37	0.555	0.237	0.981
✓	✓	✗	16.85	0.564	0.219	0.983
✓	✓	✓	17.32	0.578	0.197	0.986

Robustness test (point cloud noise / sparsity / LangSAM errors):

Method	0% Noise	25% Noise	50% Noise	75% Noise
RealFill	14.78	14.78	14.78	14.78
Ours w/o CM&JSA	16.37	14.60	14.51	14.35
Ours	17.32	17.14	17.03	16.90

Key Findings¶

GeoComplete outperforms RealFill by 2.54 dB in PSNR (17.1% relative gain) and reduces LPIPS by 0.234.
Each component contributes independently: dual-branch +1.59 PSNR, joint attention +0.48, target-aware masking +0.47.
Conditional cloud masking and joint self-attention confer strong robustness to point cloud noise (still surpassing RealFill by 2.12 dB at 75% noise).
Explicit 3D geometric priors, rather than purely generative capacity, are the key to maintaining spatial consistency.

Highlights & Insights¶

Elegant complementary masking design: reference images mask informative regions while point clouds retain them, forming a perfect complement.
Controlled attention mechanism: token-level cross-branch connections ensure precise transfer of geometric information to corresponding spatial positions.
Robustness by design: the conditional masking training strategy naturally endows the model with robustness to upstream estimation errors.
End-to-end geometry estimation: VGGT + LangSAM replaces the traditional multi-stage pipeline, eliminating cascading error accumulation.

Limitations & Future Work¶

Per-scene fine-tuning (2000 iterations) is required, precluding zero-shot generalization.
The method depends on the quality of VGGT's geometric estimation and may degrade in extreme dynamic scenes.
VGGT input is limited to 518×518, requiring downsampling for high-resolution scenes.
Only static geometry is handled; dynamic content is removed by LangSAM rather than reconstructed.
Temporal consistency in video completion settings remains unexplored.

RealFill: the primary baseline, performing reference-driven completion via LoRA fine-tuning of diffusion models but lacking geometric awareness.
VGGT: a Transformer that jointly predicts camera parameters, depth maps, and point clouds in a unified manner, replacing traditional multi-stage estimation.
TransFill: a representative traditional geometry-based method relying on a sequential pipeline.
Insight: injecting explicit 3D geometric priors into generative models is an effective path to reconciling generative capacity with spatial consistency.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of dual-branch diffusion, complementary masking, and geometry injection is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage of quantitative, qualitative, ablation, and robustness evaluations.
Writing Quality: ⭐⭐⭐⭐ Technical descriptions are clear and mathematical derivations are complete.
Value: ⭐⭐⭐⭐ A 17.1% PSNR improvement represents a significant practical advance.