ri3d few-shot gaussian splatting with repair and inpainting diffusion priors
Paper Information¶
- Conference: ICCV 2025
- arXiv: 2503.10860
- Authors: Avinash Paliwal, Xilong Zhou, Wei Ye, Jinhui Xiong, Rakesh Ranjan, Nima Khademi Kalantari
- Institutions: Texas A&M University, Meta Reality Labs, Max Planck Institute for Informatics
- Code: Project Page
- Area: 3D Vision / Sparse-View Synthesis
- Keywords: 3D Gaussian Splatting, diffusion model priors, sparse-view reconstruction, inpainting, few-shot
TL;DR¶
RI3D decomposes sparse-view synthesis into two sub-tasks — repairing visible regions and completing missing regions — and introduces two personalized diffusion models (repair + inpainting) combined with a two-stage optimization strategy to achieve high-quality 3DGS reconstruction under extremely sparse inputs.
Background & Motivation¶
Problem Definition¶
Novel view synthesis from sparse inputs (e.g., only 3 images) is an extremely challenging task. Existing methods face two core problems:
Overfitting on visible regions: Optimization constrained by only a few input images leads to severe artifacts in rendered novel views.
Failure to recover missing regions: Occluded or uncovered regions typically yield only blurry or dark results.
Limitations of Prior Work¶
- Regularization-based methods (DNGaussian, FSGS, CoR-GS, etc.): Apply depth supervision, densification strategies, and other regularization constraints, but still fail to hallucinate fine details in missing regions under extremely sparse settings.
- Diffusion model-based methods (ReconFusion, CAT3D): Train view-synthesis diffusion models to generate novel views, but the generated results lack 3D consistency, causing over-blurred optimization outcomes; additionally, these methods rely on NeRF representations with slow rendering speeds.
Core Idea¶
The key insight is to decouple the view synthesis process into two independent sub-tasks — repairing visible regions and completing missing regions — each handled by a dedicated diffusion model, thereby avoiding the difficulty of a single model simultaneously addressing both objectives.
Method¶
Overall Architecture¶
RI3D consists of three core components: 1. High-quality depth initialization: Fuses DUSt3R and monocular depth to obtain per-view depth maps. 2. Two personalized diffusion models: A repair model for correcting rendering artifacts, and an inpainting model for filling missing regions. 3. Two-stage optimization: Stage 1 reconstructs visible regions; Stage 2 completes missing regions.
1. 3D Gaussian Initialization¶
The core idea is to exploit the complementary strengths of DUSt3R depth (3D-consistent but smooth) and monocular depth (detail-rich but relative and inconsistent):
- First term: enforces depth consistency in high-confidence DUSt3R regions \(\mathbf{M}\).
- Second term: preserves monocular depth gradients (edge details) globally, with \(\lambda=10\).
- Analogous to Poisson Blending but with gradient constraints applied across all regions.
- Solved efficiently via a sparse matrix solver, followed by bilateral filtering to sharpen boundaries.
- Each pixel is assigned one Gaussian and projected into 3D space.
2. Repair Diffusion Model¶
- Fine-tuned from a pretrained ControlNet.
- Training data generation: A leave-one-out strategy constructs \(N\) subsets (each omitting one image), trains \(N\) 3DGS models — optimized for 6,000 steps first, then the excluded image is reintroduced and optimization continues to 10,000 steps — generating progressive "corrupted"–"clean" image pairs.
- Fine-tuned for 1,800 steps on the target scene for scene personalization.
3. Inpainting Diffusion Model¶
- Based on the Stable Diffusion Inpainting model.
- Fine-tuned on input images with randomly generated masks to produce input–output pairs (similar to RealFill).
- Fine-tuned for 2,000 steps for scene personalization.
- Both models operate at \(512\times512\) resolution.
4. Two-Stage Optimization¶
Stage 1: Reconstructing Visible Regions $\(\mathcal{L}_{\text{stage1}} = \sum_{i=1}^{N} \mathcal{L}_{\text{rec}}(\hat{\mathbf{I}}_i^{\text{ref}}, \mathbf{I}_i^{\text{ref}}) + \sum_{j=1}^{M} \lambda_j \mathbf{M}_j^{\alpha} \mathcal{L}_{\text{rec}}(\hat{\mathbf{I}}_j^{\text{nov}}, \mathbf{G}_j^{\text{nov}}) + \sum_{j=1}^{M} \|\mathbf{A}_j \odot (1-\mathbf{M}_j^{\alpha}) \odot \mathbf{M}_j^{b}\|_1\)$
- First term: reconstruction loss on input views (L1 + SSIM + LPIPS + depth correlation).
- Second term: supervision of \(M\) novel views using pseudo ground truth from the repair model, applied only in visible regions (\(\mathbf{M}^{\alpha}\)).
- Third term: encourages opacity in missing regions to approach zero, preventing visible Gaussians from being placed there.
- Repair results are refreshed every 400 steps; total optimization: 4,000 steps.
Stage 2: Completing Missing Regions $\(\mathcal{L}_{\text{stage2}} = \sum_{i=1}^{N} \mathcal{L}_{\text{rec}}(\hat{\mathbf{I}}_i^{\text{ref}}, \mathbf{I}_i^{\text{ref}}) + \sum_{j=1}^{M} \lambda_j \mathcal{L}_{\text{rec}}(\hat{\mathbf{I}}_j^{\text{nov}}, \mathbf{G}_j^{\text{nov}}) + \sum_{k=1}^{K} (1-\mathbf{M}_k^{\alpha}) \odot \mathbf{M}_k^{b} \odot L_p(\hat{\mathbf{I}}_k^{\text{nov}}, \hat{\mathbf{L}}_k^{\text{nov}})\)$
- \(K<M\) non-overlapping novel views are selected for inpainting to avoid inconsistencies from independently completing overlapping content.
- Completed regions are projected into 3D space via monocular depth (using Eq. 2 to fuse depth ranges).
- Third term constrains consistency between rendered results and inpainted results in missing regions.
- One round of inpainting + repair is performed every 200 steps, iterating until all missing regions are filled; total optimization: 4,000 steps.
Key Experimental Results¶
Main Results: Mip-NeRF 360 Dataset¶
| Method | 3-view PSNR↑ | 3-view SSIM↑ | 3-view LPIPS↓ | 9-view PSNR↑ | 9-view LPIPS↓ |
|---|---|---|---|---|---|
| DNGaussian | 12.02 | 0.226 | 0.665 | 12.97 | 0.637 |
| FSGS | 13.14 | 0.288 | 0.578 | 16.00 | 0.470 |
| CoR-GS | 13.51 | 0.314 | 0.633 | 15.48 | 0.574 |
| ReconFusion | 15.50 | 0.358 | 0.585 | 18.19 | 0.511 |
| CAT3D | 16.62 | 0.377 | 0.515 | 18.67 | 0.460 |
| RI3D (Ours) | 15.74 | 0.342 | 0.505 | 17.48 | 0.415 |
- RI3D achieves the best LPIPS scores across all settings, indicating the richest synthesized texture details.
- PSNR/SSIM are slightly lower than ReconFusion and CAT3D, as the blurry missing regions produced by those methods paradoxically benefit pixel-level metrics.
Ablation Study: 3-view Mip-NeRF 360¶
| Configuration | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| w/o depth enhancement | 15.61 | 0.336 | 0.513 |
| Single-stage (repair only) | 14.92 | 0.309 | 0.527 |
| w/o fine-tuned inpainting | 15.64 | 0.341 | 0.508 |
| RI3D (Full) | 15.74 | 0.342 | 0.505 |
Key Findings¶
- Depth fusion is critical: Removing depth enhancement degrades LPIPS from 0.505 to 0.513; inaccurate DUSt3R depth in low-confidence regions causes floating artifacts.
- Two-stage optimization is necessary: Using only the repair model in a single stage reduces PSNR by ~0.8 and LPIPS by 0.022, as the repair model is not designed to hallucinate high-quality textures in missing regions.
- Personalized fine-tuning improves consistency: Without fine-tuning the inpainting model, generated textures may be visually plausible but stylistically inconsistent with the surrounding scene (e.g., gray vegetation, leaf-textured floors).
- RI3D also achieves the best LPIPS on the CO3D dataset, particularly outperforming ReconFusion in the 9-view setting.
Highlights & Insights¶
- Task decomposition design: Decomposing the challenges of sparse-view synthesis into two independent sub-tasks, each handled by a specialized diffusion model, proves more effective than a single all-purpose model.
- Depth fusion strategy: The Poisson-blending-style fusion of DUSt3R and monocular depth is an elegant design that simultaneously ensures 3D consistency and preserves fine details.
- Personalized diffusion models: Fine-tuning diffusion models on the target scene (repair: 1,800 steps; inpainting: 2,000 steps) ensures that generated results are stylistically consistent with the scene.
- Fast rendering: Adopting 3DGS rather than NeRF as the 3D representation balances quality and efficiency.
Limitations & Future Work¶
- Strong dependence on DUSt3R depth quality; severely inaccurate depth estimates lead to ghosting artifacts.
- Cannot handle single-image inputs, as the leave-one-out strategy requires at least 2 images.
- Fine-tuning two diffusion models increases computational cost.
- Training requires multiple GPUs (Stage 2 requires two A5000 GPUs).
Related Work & Insights¶
- vs. ReconFusion/CAT3D: The key improvement of RI3D is decoupling "repair" from "completion," rather than using a single diffusion model for view synthesis.
- vs. GaussianObject: Borrows its ControlNet-based repair strategy but extends it from object-level to scene-level reconstruction.
- Depth fusion: Inspired by Poisson Image Editing; future work could replace DUSt3R with higher-quality MVS methods.
- Broader implication: The paradigm of task decomposition with specialized models has wide transfer value and is applicable to other 3D reconstruction problems with sparse inputs.
Rating ⭐⭐⭐⭐¶
The method is elegantly designed, with original contributions in depth fusion and two-stage optimization. RI3D achieves consistently superior LPIPS scores backed by thorough experiments. However, dependence on DUSt3R and multi-GPU requirements remain practical bottlenecks.