SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering¶
Conference: CVPR 2026 arXiv: 2603.27516 Code: https://github.com/GrumpySloths/SGS_Intrinsic.github.io Area: 3D Vision Keywords: inverse rendering, sparse-view, Gaussian splatting, material decomposition, indoor scenes
TL;DR¶
SGS-Intrinsic proposes a two-stage indoor inverse rendering framework. Stage I constructs a geometrically consistent dense Gaussian field guided by semantic and geometric priors. Stage II performs material–illumination decomposition via a hybrid lighting model and material priors, with a dedicated de-shadowing module to prevent shadow baking into albedo.
Background & Motivation¶
Sparse-view indoor inverse rendering is a severely ill-posed problem: supervision signals are scarce, indoor illumination is complex (near-field and high-frequency), and material and lighting are strongly coupled. Existing methods either focus solely on geometric reconstruction without material decomposition, assume distant light sources (unsuitable for indoor scenes), or fail under sparse-view conditions.
Three core challenges: (1) unreliable Gaussian geometry reconstruction under sparse views; (2) difficulty modeling near-field high-frequency indoor illumination; (3) cast shadows being incorrectly baked into material appearance.
Method¶
Overall Architecture¶
Two stages: Stage I initializes a dense point cloud via VGGT and builds a high-quality Gaussian geometry field supervised by normal and semantic priors; Stage II performs inverse rendering on this foundation using a hybrid lighting model (environment map + spherical Gaussian mixture), diffusion-based material priors, and a de-shadowing module.
Key Designs¶
-
Prior-Guided Dense Geometry Reconstruction:
- Function: Establish reliable Gaussian geometric foundations under sparse views.
- Mechanism: VGGT replaces traditional SfM to obtain dense scene layout point clouds. StableNormal provides normal supervision \(\mathcal{L}_{normal} = 1 - \hat{n}^T n_m\), and LSEG provides semantic supervision. A semantic consistency constraint between training views and virtual views is additionally introduced to prevent overfitting.
- Design Motivation: Traditional SfM yields only sparse point clouds under sparse views, insufficient to support Gaussian optimization. Dense priors from pretrained models compensate for the lack of supervision.
-
Hybrid Lighting Model + Material Priors:
- Function: Accurately model complex indoor illumination and enable effective material–lighting decomposition.
- Mechanism: An environment map captures distant ambient light, while a spherical Gaussian mixture (SGM) models near-field high-frequency illumination. Diffusion model-based material priors enforce cross-view and cross-illumination material consistency, yielding illumination- and view-invariant material reconstruction.
- Design Motivation: A single lighting model lacks flexibility; the hybrid scheme handles different frequency components separately. Material priors help resolve the inherent material–lighting ambiguity.
-
Lightweight De-shadowing Module:
- Function: Prevent cast shadows from being incorrectly baked into albedo.
- Mechanism: A lightweight de-shadowing model explicitly models visibility, attributing dark regions to occlusion rather than material properties. Combined with illumination-invariant material consistency constraints, this ensures the same material produces consistent albedo under varying lighting conditions.
- Design Motivation: Indoor scenes contain complex shadows; without explicit handling, the optimizer tends to absorb shadow effects into material color.
Loss & Training¶
Stage I: RGB reconstruction loss + normal loss + semantic consistency loss. Stage II: PBR rendering loss + material consistency loss + de-shadowing regularization.
Key Experimental Results¶
Main Results¶
| Method | Interiorverse NVS PSNR | Albedo Accuracy | Note |
|---|---|---|---|
| GeoSplat | Low | Low | Insufficient geometry |
| IRGS | Medium | Medium | Limited lighting model |
| SGS-Intrinsic | Best | Best | Outperforms across all metrics |
The proposed method achieves state-of-the-art performance on novel view synthesis and inverse rendering metrics across benchmark datasets.
Ablation Study¶
| Configuration | NVS Quality | Material Decomposition | Note |
|---|---|---|---|
| w/o prior guidance | Significant drop | Poor | Unreliable geometry degrades downstream |
| w/o hybrid lighting | — | Degraded | Insufficient near-field modeling |
| w/o de-shadowing | — | Shadow baking | Albedo contaminated by shadows |
| Full model | Best | Best | All components necessary |
Key Findings¶
- Dense initialization via VGGT is the critical foundation for success under sparse views—high-quality geometry is a prerequisite for high-quality inverse rendering.
- The de-shadowing module yields substantial improvement in albedo estimation quality; shadow baking is the dominant source of material estimation error in indoor scenes.
- Semantic consistency constraints effectively prevent overfitting under sparse-view conditions.
Highlights & Insights¶
- Rationale for two-stage decoupling: Geometry and material decomposition have a clear dependency—stabilizing geometry before decomposing materials is more robust than end-to-end joint optimization.
- De-shadowing as an explicit module: Modeling shadows explicitly rather than leaving them for the optimizer to handle implicitly is a simple yet critical design choice.
- Pretrained models as prior sources: The combined use of StableNormal, LSEG, VGGT, and diffusion models demonstrates an effective strategy for compensating data scarcity with rich priors in sparse-view settings.
Limitations & Future Work¶
- Reliance on multiple pretrained models (VGGT, StableNormal, LSEG, diffusion model) results in high system complexity.
- Limited capacity to handle non-Lambertian materials such as specular or transparent surfaces.
- Two-stage training is less efficient than end-to-end approaches.
- Future work may explore reducing dependence on prior models or unifying them into a single model.
Related Work & Insights¶
- vs. GeoSplat/IRGS: Both are 3DGS-based inverse rendering methods; SGS-Intrinsic achieves superior results through stronger priors and explicit de-shadowing.
- vs. NeRF-based inverse rendering: The explicit representation of 3DGS enables more straightforward disentanglement of PBR attributes.
- vs. single-image inverse rendering: Multi-view methods inherently provide 3D consistency, though sparse views introduce additional challenges.
Rating¶
- Novelty: ⭐⭐⭐⭐ Each module is well-designed and the de-shadowing idea is valuable, though the overall contribution is largely a combination of existing techniques.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive benchmark comparisons and clear ablation studies.
- Writing Quality: ⭐⭐⭐⭐ Method description is systematic and clear.
- Value: ⭐⭐⭐⭐ Direct practical value for indoor AR/VR applications.