MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo¶
Conference: ECCV 2024
arXiv: 2405.12218
Code: https://github.com/TQTQliu/MVSGaussian
Area: 3D Vision / Novel View Synthesis
Keywords: Gaussian Splatting, MVS, Generalizable Reconstruction, Hybrid Rendering, Per-scene Optimization
TL;DR¶
This work integrates cost volume-based depth estimation from MVS with 3D Gaussian Splatting, enhancing generalization through hybrid rendering (splatting + volume rendering). It proposes a geometric-consistency-based point cloud aggregation strategy that allows per-scene optimization to surpass the performance of a 10-minute 3D-GS optimization in just 45 seconds.
Background & Motivation¶
Background¶
Background: Although 3D-GS achieves real-time, high-quality rendering, it relies on time-consuming per-scene optimization. Existing feed-forward generalizable methods (such as PixelSplat) are either limited to image-pair inputs, demand heavy computational overhead, or are only applicable to object-level reconstruction. Key Challenge: The parameterized explicit representation of 3D-GS varies drastically across scene environments. How can a generalizable representation be designed to adapt to 3D-GS? Moreover, the many-to-many mapping between Gaussians and pixels in splatting is significantly harder to generalize than the volume rendering in NeRF.
Proposed Solution¶
Goal: How to build an efficient, generalizable Gaussian Splatting framework that can obtain reasonable results through feed-forward inference on unseen scenes, while allowing rapid fine-tuning within 45 seconds to match or exceed the performance of long-time 3D-GS optimization?
Method¶
Overall Architecture¶
Multi-view inputs → FPN multi-scale feature extraction → Differentiable homography warping for cost volume construction → 3D CNN regularization → Depth estimation → Pixel-aligned 3D point back-projection → Multi-view feature aggregation + 2D UNet spatial enhancement → Decode Gaussian parameters → Hybrid rendering (average of splatting & depth-aware volume rendering) → Cascade coarse-to-fine structure.
Key Designs¶
- MVS-driven Pixel-aligned Gaussian Representation: The depth is estimated utilizing a traditional MVS cost-volume pipeline, where each pixel is back-projected into 3D space to serve as a Gaussian center. Multi-view features are aggregated using a pooling network, followed by a 2D UNet to enhance spatial awareness (since each Gaussian in splatting affects multiple pixel regions).
- Hybrid Gaussian Rendering: The many-to-many mapping of pure splatting (where one Gaussian contributes to multiple pixels, and one pixel is affected by multiple Gaussians) is difficult to generalize. A simple depth-aware volume rendering branch (sampling one point per ray) is innovatively integrated to establish a one-to-one mapping as a complement. The two rendering outputs are averaged. Ablations demonstrate a significant performance boost with this design (outperforming pure splatting or pure VR alone).
- Geometrically Consistent Point Cloud Aggregation: Multi-view Gaussian point clouds generated by the generalizable model serve as the initialization for per-scene optimization. However, direct concatenation introduces substantial noise, whereas voxel down-sampling loses valid points. By computing depth reprojection errors across different views, geometric consistency is dynamically verified to retain only the points that are consistent across multiple views.
- Direct RGB Regression over SH: It is observed that learning spherical harmonics (SH) coefficients degrades performance in generalizable settings. Directly decoding RGB via an MLP yields superior results.
Loss & Training¶
- Generalizable Model: \(\text{MSE} + \lambda_s \times \text{SSIM} + \lambda_p \times \text{Perceptual loss}\) (weighted sum across Cascade stages)
- Per-scene Optimization: \(L1\) + D-SSIM (consistent with original 3D-GS)
- Randomly sample 2, 3, or 4 source views during training (probability of 0.1/0.8/0.1) to enhance generalization across varying numbers of views.
Key Experimental Results¶
| Dataset | Metric | MVSGaussian | 3D-GS (10 min) | ENeRF | Gain |
|---|---|---|---|---|---|
| DTU (Generalizable) | PSNR | 28.21 | - | 27.61 | +0.60 |
| LLFF (Generalizable) | PSNR | 24.07 | - | 23.63 | +0.44 |
| LLFF (45s Opt.) | PSNR | 26.98 | 23.92 (10 min) | 24.89 (1 h) | +3.06 |
| NeRF Syn (50s Opt.) | PSNR | 32.20 | 31.87 (7 min) | 27.57 (1 h) | +0.33 |
| Rendering Speed | FPS | 350+ | 350 | 14 | Real-time |
| Optimization Time | Seconds | 45 | 600 | 3600 | 13.3x |
Ablation Study¶
- Cascade Structure: Removing it drops the DTU performance by 1.5 dB.
- Hybrid Rendering is Key: GS + VR (28.21) > Pure GS (27.48) > Pure VR (27.39).
- RGB vs. SH: Directly regressing RGB outperforms SH by over 2 dB on NeRF Synthetic.
- Consistency Check Aggregation: PSNR 26.98 > Voxel Down-sampling 26.72 > Direct Concatenation 26.18, and achieves faster optimization (45s vs. 90s).
Highlights & Insights¶
- Insight on Hybrid Rendering: The many-to-many mapping in splatting acts as a bottleneck for generalization; complementing it with a simple one-to-one volume rendering leads to excellent synergy.
- Outperforming 3D-GS in 45s: High-quality initialization combined with geometric consistency filtering is key, proving that "a good initialization is worth a thousand optimization steps."
- View-agnostic Design: Mixed training with 2/3/4 views allows arbitrary view counts during inference.
Limitations & Future Work¶
- Inherits standard MVS limitations: Inaccurate depth estimation in weakly-textured and highly reflective areas.
- The generalization model is trained only on DTU, limiting scene diversity.
- Future work could explore scaling the hybrid rendering insights to larger-scale scenes (e.g., city-scale).
Related Work & Insights¶
- vs. MVSplat: Both use MVS cost volumes for depth estimation. However, MVSGaussian additionally incorporates hybrid rendering to improve generalization and supports per-scene optimization; MVSplat is lighter but lacks support for per-scene optimization.
- vs. PixelSplat: MVSGaussian supports multi-view inputs (not limited to dual-view), incurs lower computational overhead, and achieves a 14 dB improvement in PSNR on DTU.
- vs. 3D-GS: MVSGaussian provides high-quality initialization through feed-forward generalization, allowing a 45-second optimization to outperform the 10-minute results of 3D-GS.
Related Work & Insights¶
- The hybrid rendering (splatting + VR) concept can be migrated to other 3D tasks that require generalization.
- The geometrically consistent point cloud filtering strategy offers valuable insights for any point cloud-based 3D reconstruction.
- Co-emerging with MVSplat as the "MVS + 3DGS" duo of ECCV 2024 highlights the great value and promise of this research direction.
Rating¶
- Novelty: ⭐⭐⭐⭐ The hybrid rendering design is clever, though the overall framework is a combination of MVS and 3D-GS.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across four datasets with detailed ablations and per-scene breakdowns.
- Writing Quality: ⭐⭐⭐⭐ Well-structured with thorough ablation analysis.
- Value: ⭐⭐⭐⭐⭐ Highly practical; the 45-second rapid optimization is extremely valuable for practical applications.