SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering¶

Conference: CVPR 2026 arXiv: 2603.27516 Code: https://github.com/GrumpySloths/SGS_Intrinsic.github.io Area: 3D Vision Keywords: inverse rendering, sparse-view, Gaussian splatting, material decomposition, indoor scenes

TL;DR¶

SGS-Intrinsic proposes a two-stage indoor inverse rendering framework. Stage I constructs a geometrically consistent dense Gaussian field guided by semantic and geometric priors. Stage II performs material–illumination decomposition via a hybrid lighting model and material priors, with a dedicated de-shadowing module to prevent shadow baking into albedo.

Background & Motivation¶

Sparse-view indoor inverse rendering is a severely ill-posed problem: supervision signals are scarce, indoor illumination is complex (near-field and high-frequency), and material and lighting are strongly coupled. Existing methods either focus solely on geometric reconstruction without material decomposition, assume distant light sources (unsuitable for indoor scenes), or fail under sparse-view conditions.

Three core challenges: (1) unreliable Gaussian geometry reconstruction under sparse views; (2) difficulty modeling near-field high-frequency indoor illumination; (3) cast shadows being incorrectly baked into material appearance.

Method¶

Overall Architecture¶

Two stages: Stage I initializes a dense point cloud via VGGT and builds a high-quality Gaussian geometry field supervised by normal and semantic priors; Stage II performs inverse rendering on this foundation using a hybrid lighting model (environment map + spherical Gaussian mixture), diffusion-based material priors, and a de-shadowing module.

Key Designs¶

Prior-Guided Dense Geometry Reconstruction:
- Function: Establish reliable Gaussian geometric foundations under sparse views.
- Mechanism: VGGT replaces traditional SfM to obtain dense scene layout point clouds. StableNormal provides normal supervision \(\mathcal{L}_{normal} = 1 - \hat{n}^T n_m\), and LSEG provides semantic supervision. A semantic consistency constraint between training views and virtual views is additionally introduced to prevent overfitting.
- Design Motivation: Traditional SfM yields only sparse point clouds under sparse views, insufficient to support Gaussian optimization. Dense priors from pretrained models compensate for the lack of supervision.
Hybrid Lighting Model + Material Priors:
- Function: Accurately model complex indoor illumination and enable effective material–lighting decomposition.
- Mechanism: An environment map captures distant ambient light, while a spherical Gaussian mixture (SGM) models near-field high-frequency illumination. Diffusion model-based material priors enforce cross-view and cross-illumination material consistency, yielding illumination- and view-invariant material reconstruction.
- Design Motivation: A single lighting model lacks flexibility; the hybrid scheme handles different frequency components separately. Material priors help resolve the inherent material–lighting ambiguity.
Lightweight De-shadowing Module:
- Function: Prevent cast shadows from being incorrectly baked into albedo.
- Mechanism: A lightweight de-shadowing model explicitly models visibility, attributing dark regions to occlusion rather than material properties. Combined with illumination-invariant material consistency constraints, this ensures the same material produces consistent albedo under varying lighting conditions.
- Design Motivation: Indoor scenes contain complex shadows; without explicit handling, the optimizer tends to absorb shadow effects into material color.

Loss & Training¶

Stage I: RGB reconstruction loss + normal loss + semantic consistency loss. Stage II: PBR rendering loss + material consistency loss + de-shadowing regularization.

Key Experimental Results¶

Main Results¶

Method	Interiorverse NVS PSNR	Albedo Accuracy	Note
GeoSplat	Low	Low	Insufficient geometry
IRGS	Medium	Medium	Limited lighting model
SGS-Intrinsic	Best	Best	Outperforms across all metrics

The proposed method achieves state-of-the-art performance on novel view synthesis and inverse rendering metrics across benchmark datasets.

Ablation Study¶

Configuration	NVS Quality	Material Decomposition	Note
w/o prior guidance	Significant drop	Poor	Unreliable geometry degrades downstream
w/o hybrid lighting	—	Degraded	Insufficient near-field modeling
w/o de-shadowing	—	Shadow baking	Albedo contaminated by shadows
Full model	Best	Best	All components necessary

Key Findings¶

Dense initialization via VGGT is the critical foundation for success under sparse views—high-quality geometry is a prerequisite for high-quality inverse rendering.
The de-shadowing module yields substantial improvement in albedo estimation quality; shadow baking is the dominant source of material estimation error in indoor scenes.
Semantic consistency constraints effectively prevent overfitting under sparse-view conditions.

Highlights & Insights¶

Rationale for two-stage decoupling: Geometry and material decomposition have a clear dependency—stabilizing geometry before decomposing materials is more robust than end-to-end joint optimization.
De-shadowing as an explicit module: Modeling shadows explicitly rather than leaving them for the optimizer to handle implicitly is a simple yet critical design choice.
Pretrained models as prior sources: The combined use of StableNormal, LSEG, VGGT, and diffusion models demonstrates an effective strategy for compensating data scarcity with rich priors in sparse-view settings.

Limitations & Future Work¶

Reliance on multiple pretrained models (VGGT, StableNormal, LSEG, diffusion model) results in high system complexity.
Limited capacity to handle non-Lambertian materials such as specular or transparent surfaces.
Two-stage training is less efficient than end-to-end approaches.
Future work may explore reducing dependence on prior models or unifying them into a single model.

vs. GeoSplat/IRGS: Both are 3DGS-based inverse rendering methods; SGS-Intrinsic achieves superior results through stronger priors and explicit de-shadowing.
vs. NeRF-based inverse rendering: The explicit representation of 3DGS enables more straightforward disentanglement of PBR attributes.
vs. single-image inverse rendering: Multi-view methods inherently provide 3D consistency, though sparse views introduce additional challenges.

Rating¶

Novelty: ⭐⭐⭐⭐ Each module is well-designed and the de-shadowing idea is valuable, though the overall contribution is largely a combination of existing techniques.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive benchmark comparisons and clear ablation studies.
Writing Quality: ⭐⭐⭐⭐ Method description is systematic and clear.
Value: ⭐⭐⭐⭐ Direct practical value for indoor AR/VR applications.