VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment¶

Conference: NeurIPS 2025 arXiv: 2510.11473 Code: GitHub Area: 3D Vision Keywords: 3D Gaussian Splatting, surface reconstruction, multi-view alignment, normal consistency, geometric representation

TL;DR¶

By introducing four view alignment strategies — edge-aware image supervision, visibility-aware multi-view photometric alignment, normal consistency constraints, and depth image feature alignment — VA-GS significantly improves the geometric representation accuracy of 3D Gaussian Splatting, achieving state-of-the-art performance in surface reconstruction and novel view synthesis.

Background & Motivation¶

3D Gaussian Splatting (3DGS) has rapidly become a mainstream scene representation method due to its real-time rendering and high-quality view synthesis capabilities. However, 3DGS still exhibits significant limitations in accurate surface reconstruction. The root cause lies in the discrete and unstructured nature of Gaussian primitives: optimizing solely with RGB rendering loss leads to insufficient geometric accuracy, particularly in two scenarios:

Illumination-induced artifacts: Shadows and specular highlights distort the photometric loss, causing geometric drift in the reconstruction.

Boundary ambiguity: The directional uncertainty of Gaussian primitives at object boundaries results in poorly defined surface edges.

Existing methods such as SuGaR extract meshes via density fields but are computationally expensive; 2DGS uses 2D surface elements but fails in unbounded scenes; GS-Pull introduces SDFs but produces overly smooth surfaces; PGSR adopts a local planar assumption but cannot handle illumination issues.

The core idea of this paper is to reformulate surface reconstruction as a view alignment problem — by enforcing edge and normal alignment within single views, and photometric and feature alignment across multiple views, the geometric properties of Gaussian primitives are constrained from multiple complementary perspectives, achieving high-fidelity surface reconstruction without introducing any additional implicit representation.

Method¶

Overall Architecture¶

Given a set of posed RGB images, the goal is to learn a collection of 3D Gaussian functions and their attributes (color, opacity, position, shape) to represent the 3D scene geometry. The method augments standard 3DGS with five loss functions: \(\mathcal{L}_I\) (edge-aware image reconstruction), \(\mathcal{L}_{nc}\) (normal consistency), \(\mathcal{L}_{ns}\) (normal smoothness), \(\mathcal{L}_p\) (multi-view photometric alignment), and \(\mathcal{L}_f\) (multi-view feature alignment).

Key Designs¶

Edge-Aware Image Reconstruction (\(\mathcal{L}_I\)): Built upon the standard L1+D-SSIM loss, an image gradient term \(\beta_2 L_1(\nabla\tilde{I} - \nabla I)\) is added to supervise edge information. The design motivation is that the original color loss over-smooths high-frequency regions; adding a gradient constraint preserves sharp structures and boundary details.
Edge-Aware Normal Consistency (\(\mathcal{L}_{nc}\)): Aligns the Gaussian primitive normals \(\tilde{N}\) with the depth-gradient normals \(\hat{N}\), using an edge weight \(\delta = (1-\nabla I)^2\) to reduce the loss contribution at edge regions. The design motivation is that Gaussian normal directions at edges are ambiguous, and forcing alignment there would introduce erroneous supervision; thus the loss is down-weighted at edges.
Normal Smoothness Constraint (\(\mathcal{L}_{ns}\)): Penalizes normal differences between neighboring pixels, using a threshold \(\tau\) and a ReLU gating mechanism to distinguish true geometric edges from noise. This addresses normal noise in texture-less regions and spurious edges introduced by illumination variation.
Visibility-Aware Multi-View Photometric Alignment (\(\mathcal{L}_p\)): Drawing on classical MVS methods, reference view pixels are projected onto source views via homography \(H_{rs}\), and NCC photometric consistency is computed. Key innovations include:
- Visibility term \(\upsilon_{rs}\): determines whether the projected point falls within the source view's field of view.
- Occlusion weight \(\omega\): filters occluded or geometrically erroneous pixels using the reprojection error \(\varphi\), where \(\omega = 1/\exp(\varphi)\) (when \(\varphi < 1\)).
Multi-View Feature Alignment (\(\mathcal{L}_f\)): Extracts deep image features using a pretrained network and computes the cosine similarity between features at corresponding locations in the reference and source views. The design motivation is that image-level losses are sensitive to noise, blur, and illumination changes, whereas high-dimensional feature spaces offer greater robustness.

Loss & Training¶

The final loss is: \(\mathcal{L} = \mathcal{L}_I + \lambda_1\mathcal{L}_{nc} + \lambda_2\mathcal{L}_{ns} + \lambda_3\mathcal{L}_p + \lambda_4\mathcal{L}_f\)

Training proceeds in stages: - First 7,000 steps: pretraining with color loss only to obtain a coarse geometric initialization. - Addition of edge terms and normal alignment. - Addition of multi-view photometric alignment (8,000 steps). - Addition of multi-view feature alignment (5,000 steps). - Final retraining for novel view synthesis (10,000 steps).

Key Experimental Results¶

Main Results: DTU Surface Reconstruction (Chamfer Distance ↓)¶

Method	Mean CD	Training Time
3DGS	1.96	3.4m
2DGS	0.80	5.8m
GS-Pull	0.75	5.6m
PGSR	0.53	15m
GausSurf	0.52	-
VA-GS (Ours)	0.49	15.5m

TNT Dataset Reconstruction (F1-score ↑)¶

Method	Barn	Truck	Mean
3DGS	0.13	0.19	0.09
PGSR	0.66	0.66	0.52
GausSurf	0.50	0.65	0.50
VA-GS (Ours)	0.71	0.64	0.54

Ablation Study (TNT F1-score)¶

Configuration	Precision	Recall	F1	Note
\(\mathcal{L}_I\) only	0.09	0.23	0.13	No geometric constraint, worst
w/o \(\mathcal{L}_{nc}+\mathcal{L}_{ns}\)	0.40	0.57	0.46	Normal constraints are critical
w/o \(\mathcal{L}_p+\mathcal{L}_f\)	0.33	0.40	0.36	Multi-view alignment is indispensable
+ scale loss (planarization)	0.51	0.60	0.54	No additional gain
Full model	0.51	0.60	0.54	Modules are complementary

Key Findings¶

Flattening 3D Gaussians into planar discs (scale loss) yields no benefit for this method, and even degrades rendering quality on Mip-NeRF 360 — indicating that preserving full 3D Gaussian representation is preferable.
Using \(N=3\) source views for multi-view alignment is the optimal trade-off; \(N=4\) provides no additional gain but increases computational cost.
Removing either photometric or feature alignment individually has limited impact, but removing both causes F1 to drop sharply from 0.54 to 0.36.

Highlights & Insights¶

No external geometric representation (e.g., SDF, mesh) is required; high-quality reconstruction is achieved purely by constraining the Gaussian primitives themselves.
The visibility and occlusion weighting design is highly practical yet effective; using reprojection error as an occlusion criterion is both concise and elegant.
The strategy of down-weighting normal loss at edges (\(\delta=(1-\nabla I)^2\)) directly and effectively resolves boundary normal ambiguity.

Limitations & Future Work¶

Training is relatively slow (15.5m vs. 3.4m for 3DGS) due to the computational overhead of multi-view alignment.
The normal smoothness constraint may over-smooth regions of high curvature.
The depth image feature alignment depends on the quality of the pretrained model.

Multi-view photometric consistency is a classical approach for addressing the insufficiency of single-view supervision; the key challenge lies in handling occlusion and visibility.
The robustness of feature-level alignment to illumination variation is worth borrowing in other tasks such as SLAM and SfM.

Rating¶

Novelty: ⭐⭐⭐⭐ Each module is well-designed; classical MVS ideas are organically integrated with 3DGS.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage of DTU/TNT/Mip-NeRF360 with detailed ablations.
Writing Quality: ⭐⭐⭐⭐ Formulations are clearly derived, though notation is occasionally dense.
Value: ⭐⭐⭐⭐ Practically valuable for surface reconstruction with 3DGS.