Decompositional Neural Scene Reconstruction with Generative Diffusion Prior¶

Conference: CVPR 2025
arXiv: 2503.14830
Code: Project Page
Area: 3D Vision / Scene Reconstruction
Keywords: Decompositional Scene Reconstruction, Diffusion Prior, SDS Loss, Visibility Guidance, Sparse-View Reconstruction

TL;DR¶

Proposes DP-Recon, which introduces generative diffusion priors (SDS) into decompositional neural scene reconstruction. By dynamically adjusting pixel-wise SDS weights using visibility guidance, it resolves conflicts between reconstruction objectives and generation guidance, achieving complete object geometry and appearance recovery under sparse views.

Background & Motivation¶

Decompositional 3D scene reconstruction aims to separate a scene into individual objects, which is crucial for embodied AI, robotics, and scene editing.
Existing methods (RICO, ObjectSDF++) perform poorly in sparse views and heavily occluded regions, leading to severe degradation in geometry and appearance recovery.
Semantic/geometric regularization methods (FreeNeRF, RegNeRF) cannot provide new information for under-constrained regions.
Core Argument: The key to solving this problem lies in supplementing missing information for unobserved regions — generative priors from diffusion models serve as an ideal source.
Key Challenge: Direct introduction of SDS into the reconstruction pipeline causes conflicts in observed regions, necessitating a balance between reconstruction guidance and generation guidance.

Method¶

Overall Architecture¶

DP-Recon operates in three stages: (1) decompositional neural implicit surface reconstruction using reconstruction losses; (2) visibility-guided geometric SDS optimization applied to each object; (3) visibility-guided appearance SDS optimization after exporting meshes. Reconstruction and generation guidance are coordinated through a learnable visibility grid.

Key Designs¶

Visibility-Guided SDS Optimization:
- Function: Dynamically adjusts the SDS loss weight for each pixel, reducing generation guidance in high-visibility regions and enhancing it in low-visibility regions.
- Mechanism: Introduces a learnable visibility grid \(G\), optimized using the accumulated transmittance \(T\) in volume rendering: \(\mathcal{L}_v = \sum_{i=0}^{n} \max(T_i - G(p_i), 0)\); then renders a visibility map \(V(r) = \sum T_i \alpha_i v_i\) under novel views.
- Visibility Weighting Function: A piecewise linear function that assigns higher weights to SDS in low-visibility regions and suppresses SDS weights in high-visibility regions.
- Design Motivation: SDS exhibits artifacts such as over-saturation and over-smoothing. Observed regions with reconstruction guidance should rely primarily on reconstruction, whereas unobserved/occluded regions require generative priors to supplement information.
Decompositional Prior-Guided Geometric Optimization:
- Function: Applies SDS independently to each object to improve geometry.
- Mechanism: Renders the normal map and mask map for the \(j\)-th object to construct the input \(\tilde{n}_j\) for Stable Diffusion; the gradient is formulated as \(\nabla_\theta \mathcal{L}_{\text{SDS}}^{g-v} = \mathbb{E}[w^v(z)w(t)(\hat{\epsilon}_\phi(z_t;y,t) - \epsilon)\frac{\partial z}{\partial \tilde{n}_j}\frac{\partial \tilde{n}_j}{\partial \theta}]\).
- Utilizes OccGrid sampling to accelerate rendering, requiring only 0.01 seconds for a \(128 \times 128\) resolution.
- Design Motivation: Unlike whole-scene SDS, object-wise SDS ensures 3D consistency across views and recovers objects behind occlusions.
Mesh-Level Appearance Optimization and Background Inpainting:
- Function: Optimizes UV textures using SDS after exporting individual object meshes.
- Mechanism: Employs NVDiffrast for differentiable rendering, utilizes a small network \(\psi\) to predict surface point colors, and performs joint optimization with appearance SDS and color rendering losses.
- The background is supervised using depth-guided inpainting to generate a panoramic color map.
- Design Motivation: Optimizing appearance directly on meshes generates detailed UV mapping, which is compatible with lighting rendering and VFX editing in standard 3D software.

Loss & Training¶

Stage 1: Reconstruction loss \(\mathcal{L}_{recon}\) (color, depth, normal, SDF regularization, etc.)
Stage 2: \(\mathcal{L}_{recon} + \mathcal{L}_{\text{SDS}}^{g-v}\) (visibility-guided geometric SDS)
Stage 3: Color rendering loss + \(\mathcal{L}_{\text{SDS}}^{a-v}\) (visibility-guided appearance SDS)
Uses a pretrained Stable Diffusion (without fine-tuning), guided by text descriptions.
The visibility grid is optimized after Stage 1 and frozen during Stages 2 and 3.

Key Experimental Results¶

Main Results (Replica, 10 views)¶

Method	CD↓	F-Score↑	NC↑	PSNR↑	MUSIQ↑
MonoSDF	12.57	43.25	83.14	22.44	36.02
ObjectSDF++	8.57	50.11	85.44	24.66	41.42
Ours (geo)	7.91	50.99	89.36	25.08	43.33
Ours (full)	7.91	50.99	89.36	24.52	49.22

Decompositional Object Reconstruction (Replica)¶

Method	Object CD↓	Object F-Score↑	Object NC↑	mIoU↑
RICO	10.32	49.26	61.27	71.21
ObjectSDF++	7.49	56.69	64.75	71.72
Ours	5.54	67.71	73.50	88.21

Different Number of Views (Replica, Scene CD↓ / NC↑)¶

Method	5 views	10 views	15 views
ObjectSDF++	–	8.57 / 85.44	–
Ours	–	7.91 / 89.36	–

Key Findings¶

Reconstruction quality with 10 views outperforms baseline results obtained with 100 views (in heavily occluded scenes).
Object reconstruction mIoU increases by 16.5 pp (71.72 → 88.21), demonstrating that generative priors significantly improve object segmentation and completeness.
Object CD on the ScanNet++ real-world dataset decreases from 14.52 to 5.03, representing a 65% improvement.
The MUSIQ perceptual quality metric improves from 41.42 to 49.22, indicating that appearance SDS significantly enhances visual quality.

Highlights & Insights¶

Core Innovation: Integrates SDS into decompositional scene reconstruction for the first time, applying generative priors to each object independently rather than to the entire scene.
Elegant Visibility Guidance Design: Leverages transmittance information naturally occurring in volume rendering without requiring external visibility priors, leading to negligible computational cost.
Practical Value: The generated decoupled UV meshes can be directly imported into 3D software like Blender for VFX editing.
10 views > 100 views: Demonstrates the immense potential of generative priors under extremely sparse views.

Limitations & Future Work¶

The inherent over-saturation and over-smoothing issues of SDS are alleviated by visibility guidance but not fully eliminated.
The background is processed using panoramic inpainting, which may suffer from quality degradation in complex outdoor environments.
Dependence on Stable Diffusion means out-of-domain objects (rare/unusual objects) may exhibit poor generation results.
The training process involves multiple stages, leading to a relatively high overall training time cost.
Evaluation is limited to indoor scenes (Replica, ScanNet++), and generalization to outdoor scenes remains to be verified.

Unlike DreamFusion's SDS, DP-Recon represents a hybrid paradigm of reconstruction and generation rather than pure generation.
The visibility guidance concept can be transferred to other SDS application scenarios (e.g., balancing known/unknown regions in single-image 3D generation).
Inspires future work to replace 2D Stable Diffusion with more advanced 3D-aware diffusion models (e.g., video diffusion models).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Pioneering work integrating SDS with decompositional reconstruction, featuring an exquisitely designed visibility guidance strategy.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extensive evaluations on Replica and ScanNet++, comparisons with various baselines, experiments with different view counts, comprehensive ablation study, and demonstrations of multiple downstream applications.
Writing Quality: ⭐⭐⭐⭐ Clear problem definition with a logical progression of the methodology.
Value: ⭐⭐⭐⭐⭐ Groundbreaking demonstration of the immense value of generative priors in sparse-view decompositional reconstruction; achieving better results with 10 views than 100 views holds landmark significance.