Skip to content

SparseSurf: Sparse-View 3D Gaussian Splatting for Surface Reconstruction

Conference: AAAI 2026 arXiv: 2511.14633
Code: Project Page
Area: 3D Vision Keywords: sparse-view, surface reconstruction, 3D Gaussian splatting, stereo matching prior, pseudo-view consistency

TL;DR

SparseSurf is proposed to achieve simultaneous high-accuracy surface reconstruction and high-quality novel view synthesis under sparse-view settings, via Stereo Geometry-Texture Alignment (SGTA) and Pseudo-Feature Enhanced Geometry Consistency (PFEGC).

Background & Motivation

Reconstructing accurate 3D surface geometry from sparse-view images is a long-standing challenge. While 3D Gaussian Splatting has achieved strong results in dense-view surface reconstruction, the optimization process tends to overfit when input views are sparse, leading to severe quality degradation.

Core Motivation

Existing sparse-view methods face the following key contradictions:

Contradiction between flattened Gaussians and overfitting: To better conform to surface geometry, methods such as FatesGS flatten Gaussians into 2D planar primitives. However, the resulting high anisotropy under sparse views exacerbates overfitting risk — the reconstruction may appear normal from training views but degrades severely from novel viewpoints.

Scale ambiguity in monocular depth priors: Existing methods commonly employ monocular depth estimators for geometric constraints, but monocular depth suffers from scale ambiguity and lacks confidence estimation, which introduces noise under sparse views and leads to multi-view inconsistency.

Disconnect between rendering quality and geometric accuracy: Existing NVS methods prioritize rendering quality with loose geometric constraints, while surface reconstruction-focused methods often sacrifice rendering quality.

Core Insights

  • Stereo geometry priors can provide metric-level supervision signals that are more reliable than monocular depth priors.
  • As rendering quality improves during training, stereo rendering quality improves as well, yielding more accurate priors and forming a virtuous cycle.
  • Combining feature consistency across training views and pseudo-views can effectively mitigate the overfitting caused by flattened Gaussians.

Method

Overall Architecture

SparseSurf consists of two core modules: 1. Stereo Geometry-Texture Alignment (SGTA): Generates metric-level depth and normal priors via stereo matching. 2. Pseudo-Feature Enhanced Geometry Consistency (PFEGC): Mitigates overfitting through multi-view feature consistency across training views and pseudo-views.

Key Designs

1. Stereo Geometry-Texture Alignment (SGTA)

Function: Obtains metric-level depth and normal priors from rendered stereo image pairs to supervise the geometric structure of Gaussians.

Mechanism: - For each training viewpoint \(\mathbf{P}_i\), a stereo viewpoint with horizontal baseline \(b\) is generated. - The stereo viewpoint image is rendered and paired with the original view image as input to a pretrained stereo matching network (Foundation Stereo). - Depth \(\mathcal{D}^*\) is converted from the disparity map, and normals \(\mathcal{N}^*\) are computed accordingly.

Key Formulas:

The stereo geometry supervision loss consists of four components:

\[\mathcal{L}_{stereo} = (\lambda_d \mathcal{L}_{depth} + \lambda_n \mathcal{L}_{normal} + \lambda_{nd} \mathcal{L}_{nd}) \mathcal{M}^* + \lambda_s \mathcal{L}_{smooth}\]
  • \(\mathcal{L}_{depth} = \mathcal{L}_1(D, \mathcal{D}^*)\): depth L1 loss
  • \(\mathcal{L}_{normal} = 1 - \text{Cosine}(N, \mathcal{N}^*)\): cosine alignment between rendered and stereo normals
  • \(\mathcal{L}_{nd} = 1 - \text{Cosine}(N_d, \mathcal{N}^*)\): alignment between depth-derived normals and stereo normals
  • \(\mathcal{L}_{smooth}\): edge-aware Laplacian smoothness loss
  • \(\mathcal{M}^*\): reliability mask generated via stereo view consistency checking

Design Motivation: - Stereo matching produces metric-level depth, avoiding the scale ambiguity of monocular depth. - The consistency mask filters unreliable pixels to prevent erroneous supervision from rendering noise. - Stereo priors are updated periodically (every 300 iterations) during training, establishing a rendering→prior virtuous cycle.

2. Pseudo-Feature Enhanced Geometry Consistency (PFEGC)

Function: Alleviates overfitting of flattened Gaussians under sparse views through multi-view feature consistency constraints from pseudo-views.

Mechanism: Comprises two sub-modules — pseudo-view feature consistency and training-view feature alignment.

Pseudo-view Feature Consistency: 1. A frozen feature extractor (Vis-MVSNet) extracts features \(\mathcal{F}^*\) from GT images. 2. Each Gaussian is augmented with 8-dimensional feature attributes, learned via feature distillation loss: \(\mathcal{L}_f = 1 - \text{Cosine}(F, \mathcal{F}^*)\) 3. Feature maps are rendered at random pseudo-viewpoints \(\mathcal{V}_p\), and feature consistency is verified via bidirectional warping. 4. Patch-level cosine similarity (rather than pixel-level) is adopted to prevent low-fidelity pseudo-view regions from contaminating training-view features:

\[\mathcal{L}_{pseudo} = \sum_{i,j} \mathcal{M}_{feat}^{(i,j)} [1 - \text{Cosine}(\bar{\mathcal{F}}_{p2t}^{(i,j)}, \bar{\mathcal{F}}_r^{(i,j)})]\]

Training-view Feature Alignment: - Pixel-level feature consistency is enforced between ground-truth training views: \(\mathcal{L}_{train} = 1 - \text{Cosine}(\mathcal{F}_{s2t}, \mathcal{F}_s)\)

Design Motivation: - Pseudo-views supplement the insufficient coverage of sparse training views, though their rendering quality may be lower. - Patch-level rather than pixel-level consistency provides robustness against rendering noise in pseudo-views. - Binary confidence mask \(\mathcal{M}_{feat}\) further filters unreliable regions. - Pixel-level constraints between training views provide stronger geometric consistency guarantees.

Loss & Training

Overall loss function (7,000 iterations, single RTX 3090):

\[\mathcal{L} = \mathcal{L}_c + \mathcal{L}_{stereo} + \lambda_1 \mathcal{L}_f + \lambda_2 \mathcal{L}_{pseudo} + \lambda_3 \mathcal{L}_{train} + \lambda_4 \mathcal{L}_s + \lambda_5 \mathcal{L}_{dn}\]

Different losses are activated at different stages: - \(\mathcal{L}_c, \mathcal{L}_s, \mathcal{L}_f\): activated from iteration 0 - \(\mathcal{L}_{stereo}\): activated from iteration 500 (after rendering quality improves) - \(\mathcal{L}_{pseudo}, \mathcal{L}_{dn}\): activated from iteration 3,000

Key Experimental Results

Main Results

DTU Surface Reconstruction (Chamfer Distance↓, little-overlap setting, 3 views):

Method Type Mean CD↓
COLMAP Traditional MVS 2.61
NeuSurf Neural Implicit 1.35
FatesGS GS Surface Recon. 1.37
UFORecon Generalizable Implicit 1.40
SparseSurf Ours 1.05

DTU Surface Reconstruction (large-overlap setting, 3 views):

Method Mean CD↓
FatesGS 0.92
NeuSurf 0.99
UFORecon 0.99
SparseSurf 0.89

DTU Sparse-View Novel View Synthesis (NVS):

Method PSNR↑ SSIM↑ LPIPS↓ AVGE↓
CoR-GS 19.21 0.853 0.119 0.082
Binocular3DGS 20.71 0.862 0.111
NexusGS 20.21 0.869 0.102 0.071
SparseSurf 21.31 0.886 0.089 0.067

Ablation Study

Ablation on DTU (large-overlap setting):

Configuration Accuracy↓ Completion↓ Average↓ Note
Baseline (no extra losses) 1.318 2.302 1.810 RGB supervision only
+ \(\mathcal{L}_{stereo}\) 0.822 1.612 1.217 Stereo depth yields large improvement
+ \(\mathcal{L}_{pseudo}\) 0.610 1.327 0.969 Pseudo-views further reduce overfitting
+ \(\mathcal{L}_{train}\) (full) 0.533 1.239 0.886 Training-view alignment enhances robustness

Key Findings

  1. Stereo depth contributes the most: Mean CD decreases from 1.810 to 1.217 (33% improvement).
  2. Progressive module stacking yields consistent gains: Each module demonstrates clear independent contribution.
  3. Robustness to different stereo matching networks: Both Foundation Stereo and Stereo Anywhere are effective.
  4. Insensitivity to baseline selection: Baselines of 3%, 7%, and 10% of scene radius all perform well.
  5. Simultaneous SOTA on reconstruction and rendering: Unlike prior methods that trade off between the two objectives.

Highlights & Insights

  1. The idea of replacing monocular depth with stereo depth is insightful — leveraging the rendering capability of 3DGS to generate stereo pairs for metric-level priors.
  2. Virtuous cycle design: improved rendering quality → more accurate stereo priors → better geometric optimization → further improved rendering quality.
  3. The choice of patch-level vs. pixel-level feature consistency reflects a deep understanding of noise in pseudo-view rendering.
  4. Both reconstruction and rendering objectives are addressed simultaneously, unlike prior methods that focus on only one.

Limitations & Future Work

  1. Reliance on pretrained stereo matching networks and feature extractors introduces additional computational overhead at inference time.
  2. Performance under extreme sparsity (e.g., 2 views) remains to be validated.
  3. The pseudo-view generation strategy (interpolated from training camera positions) may lack flexibility.
  4. TSDF fusion for mesh extraction introduces an additional post-processing step.
  • FatesGS (AAAI25): Sparse-view surface reconstruction using flattened Gaussians and monocular depth; the most direct baseline for comparison.
  • GS2Mesh: The most closely related prior work, which also uses stereo matching for mesh extraction but fails under sparse views due to rendering quality degradation.
  • CoR-GS (ECCV24): A representative method that alleviates sparse-view issues through training process optimization.
  • 2DGS: Foundational method for flat Gaussian surface reconstruction.
  • NeuSurf: State-of-the-art neural implicit surface reconstruction under sparse views.

Rating

  • Novelty: ⭐⭐⭐⭐ — The bootstrapped stereo prior update mechanism and hierarchical feature consistency design represent clear innovations.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three datasets, two sparsity settings, comprehensive ablation and supplementary experiments.
  • Writing Quality: ⭐⭐⭐⭐ — Motivation analysis is thorough and method exposition is clear.
  • Value: ⭐⭐⭐⭐⭐ — Achieving simultaneous SOTA on both reconstruction and rendering under sparse views demonstrates strong practical applicability.