TRAN-D: 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update¶
Conference: ICCV 2025 arXiv: 2507.11069 Code: Project Page Area: 3D Vision Keywords: 2D Gaussian Splatting, transparent object depth reconstruction, sparse-view, physics simulation, scene update
TL;DR¶
This paper proposes TRAN-D, a 2D Gaussian Splatting-based method for sparse-view transparent object depth reconstruction. It employs segmentation-guided object-aware losses to optimize Gaussian distributions in occluded regions, and leverages physics simulation (MPM) to enable dynamic scene updates after object removal, requiring only a single image for scene refresh.
Background & Motivation¶
- Transparent object depth reconstruction is a longstanding challenge in computer vision: Due to physical properties such as reflection and refraction, accurate depth estimation of transparent objects is difficult for both traditional ToF sensors and neural rendering methods.
- Limitations of prior work:
- NeRF-based methods (Dex-NeRF, NFL, Residual-NeRF) require large numbers of training images and long training times; Residual-NeRF additionally depends on background images.
- GS-based methods (TranSplat, TransparentGS) achieve high quality but still require dense multi-view inputs.
- Sparse-view NVS methods (InstantSplat, FSGS) rely on 3D foundation models that exhibit generalization bias toward transparent objects, failing to properly distinguish them from the background.
- Dynamic scene update problem remains unsolved: When objects are moved or removed, existing methods require re-scanning the entire scene, which is time-consuming.
- Key insight: Separating transparent objects from the background and performing focused optimization on the corresponding Gaussians is critical.
Core Problem¶
- How to accurately reconstruct the depth of transparent objects from sparse views (only 6 images)?
- How to handle occluded regions (surfaces invisible from any viewpoint) and prevent Gaussian overfitting?
- After object removal, how to efficiently update the scene representation (without re-scanning) and handle cascading motions caused by the removal?
Method¶
Overall Architecture¶
TRAN-D consists of three modules: 1. Transparent object segmentation module: Instance segmentation of transparent objects based on fine-tuned Grounded SAM. 2. Object-aware 2DGS module: Joint optimization of 2D Gaussians using segmentation masks and object index one-hot vectors, combined with object-aware 3D losses to handle occluded regions. 3. Scene update module: MPM (Material Point Method) physics simulation to predict cascading motions after object removal, requiring only a single bird's-eye-view image for scene refresh.
Key Designs¶
-
Transparent Object Segmentation (Fine-tuned Grounded SAM):
- A non-lexical text prompt "786dvpteg" is used to represent "transparent object," avoiding mis-segmentation caused by generic terms such as "glass" or "transparent."
- Only the image backbone (GroundedDINO) is fine-tuned; the text backbone (BERT) is frozen. Training is conducted for 1 epoch on the synthetic TRansPose dataset.
- All transparent objects are unified into a single category and assigned a unique identifier as the category-specific prompt.
- Consistent instance segmentation masks across multiple views are ensured.
-
Segmentation Mask Rendering and Object Index One-Hot Rendering:
- Each Gaussian \(\mathcal{G}_i\) is assigned a color vector \(\mathbf{m}_i \in \mathbb{R}^3\) representing its associated object.
- The rendering equation follows the same form as color rendering: \(m(x) = \sum_i m_i \alpha_i \hat{\mathcal{G}}_i(u(x)) \prod_{j=1}^{i-1}(1-\alpha_j \hat{\mathcal{G}}_j(u(x)))\)
- An object index one-hot vector \(\mathbf{o}_i \in \mathbb{R}^{N+1}\) (N objects + 1 background) is also maintained; after softmax normalization, it is optimized using dice loss.
- Joint optimization of segmentation masks prevents the opacity of Gaussians corresponding to transparent objects from collapsing to zero during training.
-
Object-aware 3D Loss (Core Innovation):
- Problem: Sparse views combined with occlusion result in extremely weak gradients in certain regions, making it impossible to optimize the corresponding Gaussians using only view-space positional gradients.
- Solution: A hierarchical loss based on 3D distance.
- \(n_g\) farthest Gaussians are selected as group centers, each comprising \(n_n\) nearest-neighbor Gaussians.
- Distance variance loss \(\mathcal{L}_d = \text{Var}(d_1, ..., d_{n_g})\): encourages uniform spacing between group centers, pulling the high-variance distances in occluded regions toward the stable distances observed in visible regions.
- Local density loss \(\mathcal{L}_S = \text{Var}(S_1, ..., S_{n_g})\): encourages consistent intra-group density, attracting Gaussians toward sparse regions.
- Three-level hierarchical grouping strategy: \((n_g, n_n) = (16,16), (32,16), (64,32)\), adapting to the variation in Gaussian count across different optimization stages.
-
Physics Simulation-based Scene Update (MPM):
- After object removal, the corresponding Gaussians are identified and deleted via the object index one-hot vectors.
- A mesh is generated from the depth map rendered by 2D Gaussians, and used as input for physics simulation.
- MPM implemented in Taichi simulates cascading motions following object removal (100 time steps).
- Material parameters: Young's modulus \(5 \times 10^4\) Pa, Poisson's ratio 0.4.
- After simulation, 100 iterations of Gaussian re-optimization are performed (omitting the object-aware loss), requiring only a single bird's-eye-view image.
Loss & Training¶
The total loss function is: $\(\mathcal{L} = a_\text{color}\mathcal{L}_c + a_\text{mask}\mathcal{L}_m + a_\text{one-hot}\mathcal{L}_\text{one-hot} + \mathcal{L}_\text{obj}\)$
- \(\mathcal{L}_c\): RGB reconstruction loss (L1 + D-SSIM), \(a_\text{color} = 0.5\)
- \(\mathcal{L}_m\): segmentation mask loss (L1 + D-SSIM), \(a_\text{mask} = 0.5\)
- \(\mathcal{L}_\text{one-hot}\): Dice loss, \(a_\text{one-hot} = 1.0\)
- \(\mathcal{L}_\text{obj}\): object-aware 3D loss, aggregating \(\mathcal{L}_S\) (\(a_S=10000/3\)) and \(\mathcal{L}_d\) (\(a_d=1/3\)) across all levels and objects
Training details: - Initialized from random points (no dependency on SfM or 3D foundation models). - One-hot learning rate decays from 0.1 to 0.0025 over 1,000 iterations. - GPU: NVIDIA RTX 2080 Ti. - Scene update requires only 100 iterations of optimization.
Key Experimental Results¶
| Dataset | Metric | TRAN-D | Prev. SOTA (TranSplat) | Gain |
|---|---|---|---|---|
| TRansPose (t=0) | MAE | 0.0380 | 0.0632 | 39.9%↓ |
| TRansPose (t=0) | δ<2.5cm | 69.11% | 43.01% | +26.1% |
| TRansPose (t=1) | δ<2.5cm | 48.46% | 31.62% | 1.53× |
| ClearPose (t=0) | MAE | 0.0461 | 0.0905 | 49.1%↓ |
| ClearPose (t=0) | δ<2.5cm | 54.38% | 31.95% | +22.4% |
Efficiency comparison (average over 19 scenes):
| Method | t=0 Total Training Time | t=1 Total Training Time | # Gaussians (t=0) |
|---|---|---|---|
| TRAN-D | 54.1s | 13.8s | 33.5k |
| InstantSplat | 78.8s | 95.5s | 850.1k |
| TranSplat | 596.0s | 612.7s | 297.8k |
| 2DGS | 440.9s | 447.6s | 227.8k |
Ablation Study¶
- Object-aware Loss: Removing this loss increases MAE from 0.0419 to 0.0447 (t=0) and RMSE from 0.1059 to 0.1136; the Gaussian count also slightly increases (35,983 vs. 33,482), indicating that this loss reduces overfitting while compressing the Gaussian count.
- Physics Simulation: Removing physics simulation increases t=1 MAE from 0.0886 to 0.0891; without simulation, objects remain at their original Z-axis positions, leading to overfitting to training images and loss of object shape.
- Number of Views: MAE at 3/6/12 views is 0.0405/0.0419/0.0448 respectively (marginal differences), demonstrating strong robustness to sparse views; performance saturates at 6 views.
- TRAN-D with 3 views (MAE 0.0405) outperforms InstantSplat with 12 views (MAE 0.2062) by a large margin.
Highlights & Insights¶
- Introducing physics simulation into transparent object scene reconstruction: Elegantly resolves the cascading motion problem after object removal, avoiding re-scanning.
- Object-aware 3D Loss is elegantly designed: Without relying on any additional network, variance constraints on 3D distances enable Gaussians to spontaneously cover occluded regions — simple yet effective.
- Initialization from random points: Entirely eliminates dependency on SfM or 3D foundation models, yielding better performance due to the generalization bias of such models toward transparent objects.
- Exceptional efficiency: t=0 requires only 54 seconds; t=1 scene update requires only 13.8 seconds (including physics simulation); Gaussian count is merely 33k (vs. 850k for InstantSplat).
- Scene update with a single image: Combined with physics simulation, a single bird's-eye-view image suffices to update the scene, achieving 1.5× the accuracy of a 6-image baseline.
- Using the non-lexical word "786dvpteg" as the text prompt for transparent objects is a clever engineering trick.
Limitations & Future Work¶
- Heavy dependence on segmentation quality: Segmentation failures (tracking failure, strong illumination, blurry boundaries) directly cause reconstruction and physics simulation failures.
- Limited to partial object removal or minor motions: More complex dynamic scenarios (e.g., arbitrary object addition or large-scale displacement) cannot be handled.
- Only objects are rendered, not the background: Comparisons with other methods are not entirely fair in this regard (though this is consistent with the task definition).
- Future directions: Developing segmentation-free methods; handling more complex dynamics and lighting conditions; extending to more diverse real-world scenes.
Related Work & Insights¶
| Method | Type | View Requirement | Transparent Object Specialized | Dynamic Scene | Training Time |
|---|---|---|---|---|---|
| Dex-NeRF | NeRF | Dense | ✓ | ✗ | Slow |
| NFL | NeRF | Dense | ✓ | ✗ | Slow |
| TranSplat | 3DGS+Diffusion | Dense | ✓ | ✗ | Medium |
| TransparentGS | 3DGS+BSDF | Dense | ✓ | ✗ | Slow |
| InstantSplat | 3DGS+3D Foundation Model | Sparse | ✗ | ✗ | Fast |
| FSGS | 3DGS+Foundation Model | Sparse | ✗ | ✗ | Slow |
| TRAN-D | 2DGS+Physics Simulation | Sparse | ✓ | ✓ | Extremely Fast |
The paradigm of combining physics simulation with neural representations deserves attention: first reconstruct with neural methods, then use a physics engine to reason about dynamic changes, and finally fine-tune with minimal data. This pipeline is extensible to a wide range of dynamic scene understanding tasks. Segmentation-guided Gaussian optimization — treating segmentation information as an additional channel jointly splatted — is a highly generalizable idea applicable to any GS task requiring object-level control. 3D spatial regularization as an alternative to auxiliary networks is more robust in data-scarce settings than relying on pre-trained depth or 3D foundation models. The method has direct applicability in robotic manipulation scenarios involving grasping of transparent objects.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The idea of using physics simulation for scene update is novel, and the object-aware 3D loss is cleverly designed; however, the overall framework is a combination of existing components (2DGS + Grounded SAM + MPM).
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation on both synthetic and real datasets, complete ablation studies, and clear efficiency comparisons; the inability to quantitatively evaluate on real scenes is a limitation.
- Writing Quality: ⭐⭐⭐⭐ — Motivation is clearly articulated, method descriptions are detailed, and figures are of high quality; the overall structure is well-organized.
- Value: ⭐⭐⭐⭐ — Practically valuable for robotic manipulation of transparent objects; the extremely fast speed and robustness to sparse views are practically useful; however, strong dependence on segmentation limits generalizability.