MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo¶

Conference: ECCV 2024
arXiv: 2405.12218
Code: https://github.com/TQTQliu/MVSGaussian
Area: 3D Vision / Novel View Synthesis
Keywords: Gaussian Splatting, MVS, Generalizable Reconstruction, Hybrid Rendering, Per-scene Optimization

TL;DR¶

This work integrates cost volume-based depth estimation from MVS with 3D Gaussian Splatting, enhancing generalization through hybrid rendering (splatting + volume rendering). It proposes a geometric-consistency-based point cloud aggregation strategy that allows per-scene optimization to surpass the performance of a 10-minute 3D-GS optimization in just 45 seconds.

Background & Motivation¶

Background¶

Background: Although 3D-GS achieves real-time, high-quality rendering, it relies on time-consuming per-scene optimization. Existing feed-forward generalizable methods (such as PixelSplat) are either limited to image-pair inputs, demand heavy computational overhead, or are only applicable to object-level reconstruction. Key Challenge: The parameterized explicit representation of 3D-GS varies drastically across scene environments. How can a generalizable representation be designed to adapt to 3D-GS? Moreover, the many-to-many mapping between Gaussians and pixels in splatting is significantly harder to generalize than the volume rendering in NeRF.

Proposed Solution¶

Goal: How to build an efficient, generalizable Gaussian Splatting framework that can obtain reasonable results through feed-forward inference on unseen scenes, while allowing rapid fine-tuning within 45 seconds to match or exceed the performance of long-time 3D-GS optimization?

Method¶

Overall Architecture¶

Multi-view inputs → FPN multi-scale feature extraction → Differentiable homography warping for cost volume construction → 3D CNN regularization → Depth estimation → Pixel-aligned 3D point back-projection → Multi-view feature aggregation + 2D UNet spatial enhancement → Decode Gaussian parameters → Hybrid rendering (average of splatting & depth-aware volume rendering) → Cascade coarse-to-fine structure.

Key Designs¶

MVS-driven Pixel-aligned Gaussian Representation: The depth is estimated utilizing a traditional MVS cost-volume pipeline, where each pixel is back-projected into 3D space to serve as a Gaussian center. Multi-view features are aggregated using a pooling network, followed by a 2D UNet to enhance spatial awareness (since each Gaussian in splatting affects multiple pixel regions).
Hybrid Gaussian Rendering: The many-to-many mapping of pure splatting (where one Gaussian contributes to multiple pixels, and one pixel is affected by multiple Gaussians) is difficult to generalize. A simple depth-aware volume rendering branch (sampling one point per ray) is innovatively integrated to establish a one-to-one mapping as a complement. The two rendering outputs are averaged. Ablations demonstrate a significant performance boost with this design (outperforming pure splatting or pure VR alone).
Geometrically Consistent Point Cloud Aggregation: Multi-view Gaussian point clouds generated by the generalizable model serve as the initialization for per-scene optimization. However, direct concatenation introduces substantial noise, whereas voxel down-sampling loses valid points. By computing depth reprojection errors across different views, geometric consistency is dynamically verified to retain only the points that are consistent across multiple views.
Direct RGB Regression over SH: It is observed that learning spherical harmonics (SH) coefficients degrades performance in generalizable settings. Directly decoding RGB via an MLP yields superior results.

Loss & Training¶

Generalizable Model: \(\text{MSE} + \lambda_s \times \text{SSIM} + \lambda_p \times \text{Perceptual loss}\) (weighted sum across Cascade stages)
Per-scene Optimization: \(L1\) + D-SSIM (consistent with original 3D-GS)
Randomly sample 2, 3, or 4 source views during training (probability of 0.1/0.8/0.1) to enhance generalization across varying numbers of views.

Key Experimental Results¶

Dataset	Metric	MVSGaussian	3D-GS (10 min)	ENeRF	Gain
DTU (Generalizable)	PSNR	28.21	-	27.61	+0.60
LLFF (Generalizable)	PSNR	24.07	-	23.63	+0.44
LLFF (45s Opt.)	PSNR	26.98	23.92 (10 min)	24.89 (1 h)	+3.06
NeRF Syn (50s Opt.)	PSNR	32.20	31.87 (7 min)	27.57 (1 h)	+0.33
Rendering Speed	FPS	350+	350	14	Real-time
Optimization Time	Seconds	45	600	3600	13.3x

Ablation Study¶

Cascade Structure: Removing it drops the DTU performance by 1.5 dB.
Hybrid Rendering is Key: GS + VR (28.21) > Pure GS (27.48) > Pure VR (27.39).
RGB vs. SH: Directly regressing RGB outperforms SH by over 2 dB on NeRF Synthetic.
Consistency Check Aggregation: PSNR 26.98 > Voxel Down-sampling 26.72 > Direct Concatenation 26.18, and achieves faster optimization (45s vs. 90s).

Highlights & Insights¶

Insight on Hybrid Rendering: The many-to-many mapping in splatting acts as a bottleneck for generalization; complementing it with a simple one-to-one volume rendering leads to excellent synergy.
Outperforming 3D-GS in 45s: High-quality initialization combined with geometric consistency filtering is key, proving that "a good initialization is worth a thousand optimization steps."
View-agnostic Design: Mixed training with 2/3/4 views allows arbitrary view counts during inference.

Limitations & Future Work¶

Inherits standard MVS limitations: Inaccurate depth estimation in weakly-textured and highly reflective areas.
The generalization model is trained only on DTU, limiting scene diversity.
Future work could explore scaling the hybrid rendering insights to larger-scale scenes (e.g., city-scale).

vs. MVSplat: Both use MVS cost volumes for depth estimation. However, MVSGaussian additionally incorporates hybrid rendering to improve generalization and supports per-scene optimization; MVSplat is lighter but lacks support for per-scene optimization.
vs. PixelSplat: MVSGaussian supports multi-view inputs (not limited to dual-view), incurs lower computational overhead, and achieves a 14 dB improvement in PSNR on DTU.
vs. 3D-GS: MVSGaussian provides high-quality initialization through feed-forward generalization, allowing a 45-second optimization to outperform the 10-minute results of 3D-GS.

The hybrid rendering (splatting + VR) concept can be migrated to other 3D tasks that require generalization.
The geometrically consistent point cloud filtering strategy offers valuable insights for any point cloud-based 3D reconstruction.
Co-emerging with MVSplat as the "MVS + 3DGS" duo of ECCV 2024 highlights the great value and promise of this research direction.

Rating¶

Novelty: ⭐⭐⭐⭐ The hybrid rendering design is clever, though the overall framework is a combination of MVS and 3D-GS.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across four datasets with detailed ablations and per-scene breakdowns.
Writing Quality: ⭐⭐⭐⭐ Well-structured with thorough ablation analysis.
Value: ⭐⭐⭐⭐⭐ Highly practical; the 45-second rapid optimization is extremely valuable for practical applications.