Murre: Multi-view Reconstruction via SfM-guided Monocular Depth Estimation¶

Conference: CVPR 2025
arXiv: 2503.14483
Code: https://zju3dv.github.io/murre/
Area: 3D Vision / Multi-view Reconstruction
Keywords: Multi-view Reconstruction, SfM Guidance, Monocular Depth Estimation, Diffusion Models, Depth Completion

TL;DR¶

Proposes Murre, a novel multi-view 3D reconstruction framework. By injecting sparse SfM point clouds into a diffusion model to guide monocular depth estimation, it bypasses the multi-view matching step of traditional MVS and outperforms the SOTA on various real-world scenes (indoor, street, aerial).

Background & Motivation¶

Background: Learning-based MVS methods perform poorly in low-texture areas and sparse views, and 3D cost volumes consume a significant amount of GPU memory.

Limitations of Prior Work: MVS implicitly relies on multi-view consistency, which leads to matching failures under sparse views; monocular depth estimation does not require matching but lacks multi-view consistency and metric information.

Key Challenge: Multi-view consistency requires matching, but matching is unreliable in challenging scenes; monocular prediction does not require matching but lacks consistency.

Core Idea: Use SfM point clouds as an explicit intermediate representation to inject multi-view information into a monocular depth diffusion model, combining the advantages of both.

Method¶

Overall Architecture¶

Given multi-view images: (1) SfM reconstructs a sparse point cloud; (2) the point cloud is projected onto each view to form sparse depth maps, which are then densified; (3) the densified depth maps and RGB images are used as conditional inputs to a diffusion model to predict metric depth; (4) TSDF fusion is applied to obtain the final geometry.

Key Designs¶

SfM Prior Injection into Diffusion Model:
- Function: Provides multi-view consistent metric information for monocular depth estimation.
- Mechanism: Densifies SfM sparse depth maps using KNN interpolation and computes a distance map (measuring the distance of each pixel to the nearest valid point) as a confidence indicator. The densified depth map and the distance map are fed together as conditions into a depth diffusion model based on Stable Diffusion V2.
- Design Motivation: SfM point clouds are a condensed representation of multi-view information, naturally providing metric scale and prominent scene structures.
Depth Normalization and Scale Alignment:
- Function: Handles depth-range discrepancies across different scenes and viewpoints.
- Mechanism: First filters out the top and bottom 2% outliers of SfM depths, then expands the range to \(0.8 \times \min\) and \(1.2 \times \max\), and uses this range to normalize ground-truth depths for training. During inference, RANSAC linear regression is used to align the predicted depths with SfM depths.
- Design Motivation: SfM depths contain outliers and cover only a fraction of pixels, requiring a robust normalization strategy.
Stable Diffusion-Based Depth Estimation:
- Function: Leverages the strong priors of 2D foundation models to achieve generalization.
- Mechanism: Initialized from SD V2, freezing the VAE and only fine-tuning the UNet. The depth is duplicated into three channels and mapped to the latent space via the VAE encoder, where denoising and diffusion actions take place.
- Design Motivation: Fine-tuning on a small amount of synthetic data enables generalization to diverse real-world scenes.

Loss & Training¶

Standard diffusion denoising loss. Detector-free SfM is utilized to handle low-texture regions. The training data includes synthetic scenes.

Key Experimental Results¶

Main Results¶

Dataset	Ours vs Prev. SOTA
DTU	Outperforms existing MVS and neural implicit methods
ScanNet	Competitive performance
Waymo	Outperforms monocular methods
UrbanScene3D	Outperforms MVS

Key Findings¶

SfM guidance significantly improves depth consistency and metric accuracy.
KNN densification combined with the distance map is more effective than directly using sparse depths.
Training on synthetic data is sufficient to generalize to various real-world scenarios.

Highlights & Insights¶

Injecting SfM as a "condensed" representation of multi-view information into a monocular model is an elegant idea.
Bypassing the 3D cost volume effectively resolves the twin bottlenecks of GPU memory and sparse viewpoints.
The generalization ability of 2D foundation models is effectively unleashed through SfM guidance.

Limitations & Future Work¶

Dependent on the quality of SfM; the method is affected when SfM fails.
Multi-step inference of diffusion models introduces additional computational overhead.
Dynamic scenes require extra handling.

Rating¶

Novelty: 8/10 — The combination of SfM and diffusion-based depth is novel.
Technical Depth: 8/10 — The normalization and alignment strategies are detailed and well-designed.
Experimental Thoroughness: 9/10 — Validated across five scene types.
Writing Quality: 8/10 — The methodology description is clear.