TRAN-D: 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update¶

Conference: ICCV 2025 arXiv: 2507.11069 Code: Project Page Area: 3D Vision Keywords: 2D Gaussian Splatting, transparent object depth reconstruction, sparse-view, physics simulation, scene update

TL;DR¶

This paper proposes TRAN-D, a 2D Gaussian Splatting-based method for sparse-view transparent object depth reconstruction. It employs segmentation-guided object-aware losses to optimize Gaussian distributions in occluded regions, and leverages physics simulation (MPM) to enable dynamic scene updates after object removal, requiring only a single image for scene refresh.

Background & Motivation¶

Transparent object depth reconstruction is a longstanding challenge in computer vision: Due to physical properties such as reflection and refraction, accurate depth estimation of transparent objects is difficult for both traditional ToF sensors and neural rendering methods.
Limitations of prior work:
- NeRF-based methods (Dex-NeRF, NFL, Residual-NeRF) require large numbers of training images and long training times; Residual-NeRF additionally depends on background images.
- GS-based methods (TranSplat, TransparentGS) achieve high quality but still require dense multi-view inputs.
- Sparse-view NVS methods (InstantSplat, FSGS) rely on 3D foundation models that exhibit generalization bias toward transparent objects, failing to properly distinguish them from the background.
Dynamic scene update problem remains unsolved: When objects are moved or removed, existing methods require re-scanning the entire scene, which is time-consuming.
Key insight: Separating transparent objects from the background and performing focused optimization on the corresponding Gaussians is critical.

Core Problem¶

How to accurately reconstruct the depth of transparent objects from sparse views (only 6 images)?
How to handle occluded regions (surfaces invisible from any viewpoint) and prevent Gaussian overfitting?
After object removal, how to efficiently update the scene representation (without re-scanning) and handle cascading motions caused by the removal?

Method¶

Overall Architecture¶

TRAN-D consists of three modules: 1. Transparent object segmentation module: Instance segmentation of transparent objects based on fine-tuned Grounded SAM. 2. Object-aware 2DGS module: Joint optimization of 2D Gaussians using segmentation masks and object index one-hot vectors, combined with object-aware 3D losses to handle occluded regions. 3. Scene update module: MPM (Material Point Method) physics simulation to predict cascading motions after object removal, requiring only a single bird's-eye-view image for scene refresh.

Key Designs¶

Transparent Object Segmentation (Fine-tuned Grounded SAM):
- A non-lexical text prompt "786dvpteg" is used to represent "transparent object," avoiding mis-segmentation caused by generic terms such as "glass" or "transparent."
- Only the image backbone (GroundedDINO) is fine-tuned; the text backbone (BERT) is frozen. Training is conducted for 1 epoch on the synthetic TRansPose dataset.
- All transparent objects are unified into a single category and assigned a unique identifier as the category-specific prompt.
- Consistent instance segmentation masks across multiple views are ensured.
Segmentation Mask Rendering and Object Index One-Hot Rendering:
- Each Gaussian $\mathcal{G}_i$ is assigned a color vector $\mathbf{m}_i \in \mathbb{R}^3$ representing its associated object.
- The rendering equation follows the same form as color rendering: $m(x) = \sum_i m_i \alpha_i \hat{\mathcal{G}}_i(u(x)) \prod_{j=1}^{i-1}(1-\alpha_j \hat{\mathcal{G}}_j(u(x)))$
- An object index one-hot vector $\mathbf{o}_i \in \mathbb{R}^{N+1}$ (N objects + 1 background) is also maintained; after softmax normalization, it is optimized using dice loss.
- Joint optimization of segmentation masks prevents the opacity of Gaussians corresponding to transparent objects from collapsing to zero during training.
Object-aware 3D Loss (Core Innovation):
- Problem: Sparse views combined with occlusion result in extremely weak gradients in certain regions, making it impossible to optimize the corresponding Gaussians using only view-space positional gradients.
- Solution: A hierarchical loss based on 3D distance.
- $n_g$ farthest Gaussians are selected as group centers, each comprising $n_n$ nearest-neighbor Gaussians.
- Distance variance loss $\mathcal{L}_d = \text{Var}(d_1, ..., d_{n_g})$: encourages uniform spacing between group centers, pulling the high-variance distances in occluded regions toward the stable distances observed in visible regions.
- Local density loss $\mathcal{L}_S = \text{Var}(S_1, ..., S_{n_g})$: encourages consistent intra-group density, attracting Gaussians toward sparse regions.
- Three-level hierarchical grouping strategy: $(n_g, n_n) = (16,16), (32,16), (64,32)$, adapting to the variation in Gaussian count across different optimization stages.
Physics Simulation-based Scene Update (MPM):
- After object removal, the corresponding Gaussians are identified and deleted via the object index one-hot vectors.
- A mesh is generated from the depth map rendered by 2D Gaussians, and used as input for physics simulation.
- MPM implemented in Taichi simulates cascading motions following object removal (100 time steps).
- Material parameters: Young's modulus $5 \times 10^4$ Pa, Poisson's ratio 0.4.
- After simulation, 100 iterations of Gaussian re-optimization are performed (omitting the object-aware loss), requiring only a single bird's-eye-view image.

Loss & Training¶

The total loss function is: $$\mathcal{L} = a_\text{color}\mathcal{L}_c + a_\text{mask}\mathcal{L}_m + a_\text{one-hot}\mathcal{L}_\text{one-hot} + \mathcal{L}_\text{obj}$$

$\mathcal{L}_c$: RGB reconstruction loss (L1 + D-SSIM), $a_\text{color} = 0.5$
$\mathcal{L}_m$: segmentation mask loss (L1 + D-SSIM), $a_\text{mask} = 0.5$
$\mathcal{L}_\text{one-hot}$: Dice loss, $a_\text{one-hot} = 1.0$
$\mathcal{L}_\text{obj}$: object-aware 3D loss, aggregating $\mathcal{L}_S$ ($a_S=10000/3$) and $\mathcal{L}_d$ ($a_d=1/3$) across all levels and objects

Training details: - Initialized from random points (no dependency on SfM or 3D foundation models). - One-hot learning rate decays from 0.1 to 0.0025 over 1,000 iterations. - GPU: NVIDIA RTX 2080 Ti. - Scene update requires only 100 iterations of optimization.

Key Experimental Results¶

Dataset	Metric	TRAN-D	Prev. SOTA (TranSplat)	Gain
TRansPose (t=0)	MAE	0.0380	0.0632	39.9%↓
TRansPose (t=0)	δ<2.5cm	69.11%	43.01%	+26.1%
TRansPose (t=1)	δ<2.5cm	48.46%	31.62%	1.53×
ClearPose (t=0)	MAE	0.0461	0.0905	49.1%↓
ClearPose (t=0)	δ<2.5cm	54.38%	31.95%	+22.4%

Efficiency comparison (average over 19 scenes):

Method	t=0 Total Training Time	t=1 Total Training Time	# Gaussians (t=0)
TRAN-D	54.1s	13.8s	33.5k
InstantSplat	78.8s	95.5s	850.1k
TranSplat	596.0s	612.7s	297.8k
2DGS	440.9s	447.6s	227.8k

Ablation Study¶

Object-aware Loss: Removing this loss increases MAE from 0.0419 to 0.0447 (t=0) and RMSE from 0.1059 to 0.1136; the Gaussian count also slightly increases (35,983 vs. 33,482), indicating that this loss reduces overfitting while compressing the Gaussian count.
Physics Simulation: Removing physics simulation increases t=1 MAE from 0.0886 to 0.0891; without simulation, objects remain at their original Z-axis positions, leading to overfitting to training images and loss of object shape.
Number of Views: MAE at 3/6/12 views is 0.0405/0.0419/0.0448 respectively (marginal differences), demonstrating strong robustness to sparse views; performance saturates at 6 views.
TRAN-D with 3 views (MAE 0.0405) outperforms InstantSplat with 12 views (MAE 0.2062) by a large margin.

Highlights & Insights¶

Introducing physics simulation into transparent object scene reconstruction: Elegantly resolves the cascading motion problem after object removal, avoiding re-scanning.
Object-aware 3D Loss is elegantly designed: Without relying on any additional network, variance constraints on 3D distances enable Gaussians to spontaneously cover occluded regions — simple yet effective.
Initialization from random points: Entirely eliminates dependency on SfM or 3D foundation models, yielding better performance due to the generalization bias of such models toward transparent objects.
Exceptional efficiency: t=0 requires only 54 seconds; t=1 scene update requires only 13.8 seconds (including physics simulation); Gaussian count is merely 33k (vs. 850k for InstantSplat).
Scene update with a single image: Combined with physics simulation, a single bird's-eye-view image suffices to update the scene, achieving 1.5× the accuracy of a 6-image baseline.
Using the non-lexical word "786dvpteg" as the text prompt for transparent objects is a clever engineering trick.

Limitations & Future Work¶

Heavy dependence on segmentation quality: Segmentation failures (tracking failure, strong illumination, blurry boundaries) directly cause reconstruction and physics simulation failures.
Limited to partial object removal or minor motions: More complex dynamic scenarios (e.g., arbitrary object addition or large-scale displacement) cannot be handled.
Only objects are rendered, not the background: Comparisons with other methods are not entirely fair in this regard (though this is consistent with the task definition).
Future directions: Developing segmentation-free methods; handling more complex dynamics and lighting conditions; extending to more diverse real-world scenes.

Method	Type	View Requirement	Transparent Object Specialized	Dynamic Scene	Training Time
Dex-NeRF	NeRF	Dense	✓	✗	Slow
NFL	NeRF	Dense	✓	✗	Slow
TranSplat	3DGS+Diffusion	Dense	✓	✗	Medium
TransparentGS	3DGS+BSDF	Dense	✓	✗	Slow
InstantSplat	3DGS+3D Foundation Model	Sparse	✗	✗	Fast
FSGS	3DGS+Foundation Model	Sparse	✗	✗	Slow
TRAN-D	2DGS+Physics Simulation	Sparse	✓	✓	Extremely Fast

The paradigm of combining physics simulation with neural representations deserves attention: first reconstruct with neural methods, then use a physics engine to reason about dynamic changes, and finally fine-tune with minimal data. This pipeline is extensible to a wide range of dynamic scene understanding tasks. Segmentation-guided Gaussian optimization — treating segmentation information as an additional channel jointly splatted — is a highly generalizable idea applicable to any GS task requiring object-level control. 3D spatial regularization as an alternative to auxiliary networks is more robust in data-scarce settings than relying on pre-trained depth or 3D foundation models. The method has direct applicability in robotic manipulation scenarios involving grasping of transparent objects.

Rating¶

Novelty: ⭐⭐⭐⭐ — The idea of using physics simulation for scene update is novel, and the object-aware 3D loss is cleverly designed; however, the overall framework is a combination of existing components (2DGS + Grounded SAM + MPM).
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation on both synthetic and real datasets, complete ablation studies, and clear efficiency comparisons; the inability to quantitatively evaluate on real scenes is a limitation.
Writing Quality: ⭐⭐⭐⭐ — Motivation is clearly articulated, method descriptions are detailed, and figures are of high quality; the overall structure is well-organized.
Value: ⭐⭐⭐⭐ — Practically valuable for robotic manipulation of transparent objects; the extremely fast speed and robustness to sparse views are practically useful; however, strong dependence on segmentation limits generalizability.