Splat-SAP: Feed-Forward Gaussian Splatting for Human-Centered Scene with Scale-Aware Point Map Reconstruction¶
Conference: AAAI 2026 arXiv: 2511.22704 Code: Project Page Area: 3D Vision Keywords: feed-forward Gaussian splatting, human-centered scene, scale-awareness, point map reconstruction, free-viewpoint rendering
TL;DR¶
This paper proposes Splat-SAP, a feed-forward method that reconstructs scale-aware point maps from wide-baseline stereo camera inputs and renders free-viewpoint video of human-centered scenes via a Gaussian Plane, requiring neither per-scene optimization nor 3D geometric supervision.
Background & Motivation¶
Feed-forward free-viewpoint video synthesis is critical for applications such as telepresence and sports broadcasting. Existing feed-forward Gaussian splatting methods suffer from the following challenges:
Challenge 1: Geometric failure under large-baseline inputs - Methods such as MVSplat and MVSGaussian rely on multi-view stereo matching to establish geometric priors. - These methods require substantial overlap between input views. - When the two input cameras are widely separated (large baseline), reliable geometric priors cannot be obtained.
Challenge 2: Scale ambiguity in DUSt3R-based methods - DUSt3R/MASt3R introduce point map representations capable of predicting reasonable geometry under large baselines. - However, they normalize point maps into a scale-invariant canonical space. - During per-frame inference, inconsistent scale normalization causes severe temporal jitter in reconstructions. - Depth variations induced by human motion produce large discontinuities in canonical space.
Challenge 3: Difficulty in acquiring 3D supervision data - Training scale-aware geometric foundation models typically requires large quantities of 3D data. - Acquiring 3D geometric data is time-consuming and cumbersome.
The core contribution of Splat-SAP is to learn a scale-aware point map transformation in a self-supervised manner, mapping canonical-space point maps to metric space without any 3D geometric supervision.
Method¶
Overall Architecture¶
A coarse-to-fine two-stage pipeline: - Stage 1 (2D coarse stage): Starting from MASt3R-initialized point maps, learns an affine transformation (scaling + translation) to map them from canonical space to metric space. - Stage 2 (3D fine stage): Projects the transformed point maps to the target viewpoint, performs stereo refinement via a 3D cost volume, constructs a Gaussian Plane, and renders high-quality novel views.
Key Designs¶
1. Scale-Aware Geometry Reconstruction: Self-supervised affine transformation learning¶
Point map initialization: MASt3R is used to predict point maps \(X^i\) (in canonical space) from low-resolution (512×288) stereo inputs.
Scale factor learning: - Camera intrinsic focal length \(f\) and baseline distance \(d\) are embedded via positional encoding. - Global information is extracted from ViT features via self-attention and cross-attention. - A 3D scaling factor \(S\) is predicted by an MLP to handle distortions in the original point maps.
Per-pixel translation learning: - Scaling alone cannot eliminate per-pixel offsets between the two point maps. - Inspired by view consistency checks in MVS, features from one view are projected into the other to obtain correspondences. - A GRU iteratively estimates per-pixel translation:
The final metric-space point positions are: \(X_t^i = SX^i + T^i\)
Design motivation: Scaling (via intrinsic embedding) combined with translation (via extrinsic projection) constitutes an affine transformation from canonical space to metric space.
2. Gaussian Plane Rendering: Efficient and complete rendering¶
3D refinement: - The transformed point set is projected to the target viewpoint via α-blending to obtain an initial depth map \(\mathcal{D}^k\). - Multiple depth candidates are sampled along the camera ray near the initial depth. - Source-view features are warped to the target view to construct a 3D cost volume. - Refined depth is obtained via 3D convolution and depth probability regression: \(\bar{d} = \Sigma_n w_n d_n\).
Gaussian Plane construction: - Gaussian primitives are anchored on the target-view image plane rather than using source-view point maps as Gaussian positions, substantially reducing redundancy in overlapping regions. - Color initialization: weighted colors warped from source views: $\(C^k = \Sigma_i w_c^i C^{i \rightarrow k}\)$ - Remaining attributes (rotation, scale, opacity) are predicted by convolutional heads from aggregated features. - Residual color learning: \(\mathcal{P}_c = \alpha C + (1-\alpha) \Delta C\)
Final rendering is performed at 1024×576 resolution and splatted to 1280×720 output.
3. Self-supervised Training Strategy: No 3D geometric supervision required¶
Stage 1 loss: $\(\mathcal{L}_{stage1} = \mathcal{L}_{render} + \gamma \mathcal{L}_{CD}\)$
where \(\mathcal{L}_{CD}\) is a Chamfer distance regularization between the two 6D point sets (XYZ+RGB), encouraging both point maps to converge to a consistent geometry. MASt3R weights are frozen during training.
Stage 2 loss: $\(\mathcal{L}_{stage2} = \lambda_1 \mathcal{L}_{render}(\hat{I}_f, I_f^{gt}) + \lambda_2 \mathcal{L}_{render}(\hat{I}_h, I_h^{gt})\)$
Both stages require no 3D geometric supervision and are trained entirely with rendering losses.
Loss & Training¶
- Rendering loss: \(\mathcal{L}_{render} = 0.8 \mathcal{L}_1 + 0.2 \mathcal{L}_{ssim}\)
- Stage 1: 100k iterations for the affine learning module (using all training data)
- Stage 2: 60k iterations per camera type for the rendering module
- Trainable on a single RTX 3090 (24 GB)
Key Experimental Results¶
Main Results (Rendering Quality)¶
| Method | Camera PSNR↑ | Camera SSIM↑ | GoPro PSNR↑ | GoPro SSIM↑ | Mobile PSNR↑ | Mobile SSIM↑ |
|---|---|---|---|---|---|---|
| NoPoSplat | 25.035 | 0.866 | 26.128 | 0.889 | 21.594 | 0.591 |
| 4D-GS | 27.814 | 0.906 | 27.244 | 0.907 | 25.655 | 0.825 |
| MVSplat | 27.899 | 0.902 | 29.942 | 0.934 | 26.545 | 0.805 |
| MVSGaussian | 29.326 | 0.957 | 27.413 | 0.926 | 19.927 | 0.683 |
| ENeRF | 28.272 | 0.943 | 29.906 | 0.943 | 20.579 | 0.640 |
| Splat-SAP | 32.220 | 0.957 | 31.640 | 0.955 | 25.721 | 0.827 |
Splat-SAP achieves substantial PSNR improvements on Camera and GoPro datasets (+2.9 and +1.7 dB, respectively).
Geometry Reconstruction Quality¶
| Method | Pred→GT CD↓ | GT→Pred CD↓ | Notes |
|---|---|---|---|
| DUSt3R | 0.305 | 0.160 | Significant foreground–background misalignment |
| VGGT | 0.288 | 0.129 | Difficulty in two-view alignment |
| Pow3R | 0.281 | 0.134 | Insufficient even with camera calibration |
| MASt3R | 0.212 | 0.069 | Baseline geometry |
| Prompt-DA | 0.205 | 0.063 | Adds uncertainty estimation |
| Ours w/o Translation | 0.191 | 0.046 | Scaling only |
| Ours Full | 0.172 | 0.027 | Scaling + translation |
Ablation Study¶
| Configuration | PSNR↑ | SSIM↑ | LPIPS↓ | Notes |
|---|---|---|---|---|
| Stage 1 rendering | 24.844 | 0.794 | 0.296 | Auxiliary layer rendering at coarse stage only |
| Stage 2 initial color | 27.308 | 0.856 | 0.169 | Warped color after geometric refinement |
| Stage 2 full splatting | 28.703 | 0.889 | 0.169 | Complete pipeline |
Key Findings¶
- Per-pixel translation learning is critical for eliminating point map alignment errors (Pred→GT CD reduced from 0.191 to 0.172).
- The 3D refinement module corrects holes and artifacts from Stage 1.
- Residual color learning and the splatting mechanism further improve rendering quality.
- The method remains competitive on Mobile data (alternating zoom scenarios).
- Fully self-supervised training without 3D ground truth still outperforms methods that rely on 3D supervision, such as DUSt3R.
Highlights & Insights¶
- Self-supervised scale recovery: Camera intrinsic embedding and extrinsic projection are elegantly exploited to learn the canonical-to-metric affine transformation without any 3D supervision.
- Gaussian Plane design: Anchoring Gaussians on the target-view image plane avoids redundancy from dual source-view point maps.
- Coarse-to-fine geometric strategy: 2D affine coarse alignment followed by 3D cost-volume refinement progressively improves geometric accuracy.
- Chamfer distance regularization: CD computed in 6D space (position + color) simultaneously constrains geometric and appearance consistency.
- Practical multi-camera support: A single affine module is shared across camera types; only one rendering module per camera type needs to be trained.
Limitations & Future Work¶
- Foreground–background boundary floaters: MASt3R may produce floaters at human silhouette boundaries, which the refinement module cannot correct since these regions are observed by only one view.
- Only stereo input is supported; scenarios with more than two views are not explored.
- The method has a strong dependency on the pretrained MASt3R model.
- The smaller performance gap relative to MVSplat on Mobile data indicates room for improvement in zoom-variant scenarios.
- Camera calibration information is required, limiting applicability in uncalibrated settings.
Related Work & Insights¶
- DUSt3R/MASt3R: Pioneering works on point map representations; Splat-SAP builds upon them to resolve the scale ambiguity.
- GPS-Gaussian/GPS-Gaussian+: Precursor feed-forward stereo Gaussian methods, but require dense view overlap.
- NoPoSplat/Splat3R: Leverage point maps for static scene rendering but lack stereo constraints.
- ENeRF: A feed-forward method combining cost volumes with NeRF; Splat-SAP adopts its depth probability regression strategy.
- Insight: The combination of point maps and stereo matching appears to be a promising paradigm for sparse-view human rendering.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Self-supervised scale recovery and the Gaussian Plane design are original contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Validated across multiple camera types with dual evaluation of rendering and geometry.
- Writing Quality: ⭐⭐⭐⭐ — The two-stage structure is clearly presented, though some details require consulting the supplementary material.
- Value: ⭐⭐⭐⭐⭐ — Direct practical value for real-time applications such as telepresence and sports broadcasting.