OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction¶

Conference: AAAI 2026 arXiv: 2601.04984 Code: oceansplat.github.io Area: 3D Vision Keywords: 3D Gaussian Splatting, underwater scene reconstruction, trinocular view consistency, depth regularization, scattering medium

TL;DR¶

This paper proposes OceanSplat, which achieves high-fidelity underwater 3D Gaussian Splatting scene reconstruction under scattering media through trinocular view consistency constraints, synthetic epipolar depth priors, and depth-aware alpha adjustment, significantly reducing floating artifacts and surpassing existing methods.

Background & Motivation¶

Underwater scene reconstruction is essential for marine robotics tasks such as seabed mapping, ecological monitoring, and underwater infrastructure inspection. However, the optical properties of underwater environments — wavelength-dependent attenuation, scattering, and low illumination — severely degrade visual cues, posing significant challenges for vision-based scene reconstruction.

Limitations of Prior Work:

NeRF-based methods (SeaThru-NeRF, etc.): embed underwater physical models into volumetric rendering, but implicit representations hinder precise geometric understanding and suffer from slow rendering speeds.

3DGS-based methods (SeaSplat, WaterSplatting, etc.): while rendering is fast, medium intensity is often absorbed into the 3D Gaussians, leading to extensive floating artifacts, entanglement between 3D Gaussians and the scattering medium, and degraded reconstruction quality.

Key Challenge: In scattering media, view-dependent sampling in alpha-blending leads to multi-view inconsistency, causing 3D Gaussians to erroneously represent the water volume itself rather than scene objects, producing floating artifacts.

Key Insight: - Drawing on the advantage of multi-baseline stereo over single-baseline, the paper extends binocular consistency to trinocular consistency (horizontal + vertical virtual viewpoints), providing orthogonal constraints. - Self-supervised depth priors are generated via triangulation between virtual viewpoints. - Depth-aware alpha adjustment suppresses 3D Gaussians in medium regions during early training.

Method¶

Overall Architecture¶

OceanSplat builds upon the 3DGS framework, initializing 3D Gaussians with SfM and modeling underwater medium properties (attenuation, backscattering, medium color) via an MLP. Four key modules are introduced during training: trinocular view consistency, synthetic epipolar depth prior, depth residual loss, and depth-aware alpha adjustment.

The underwater image formation model decomposes the observed image into attenuated object color and backscattering: $$C = C^{obj} \cdot e^{-\sigma^{attn} \cdot z} + C^{\infty} \cdot (1 - e^{-\sigma^{bs} \cdot z})$$

Object and medium rendering are accumulated separately via alpha-blending, enabling object–medium disentanglement.

Key Designs¶

Trinocular View Consistency

Mechanism: Two virtual viewpoints $P_h$ and $P_v$ (horizontal and vertical) are generated from the original camera pose $P_c$, and consistency across all three views is enforced to regularize the spatial positions of 3D Gaussians.

Virtual viewpoints are constructed via translation: $P_h = \begin{bmatrix} \mathbb{I} & \mathbf{t}_h \\ \mathbf{0}^\top & 1 \end{bmatrix} P_c, \quad P_v = \begin{bmatrix} \mathbb{I} & \mathbf{t}_v \\ \mathbf{0}^\top & 1 \end{bmatrix} P_c$ where $\mathbf{t}_h = (b_h, 0, 0)^\top$, $\mathbf{t}_v = (0, b_v, 0)^\top$.

After rendering images from virtual viewpoints, disparity maps are computed from depth maps to perform inverse warping, aligning virtual-viewpoint images to the center view: $d_h(x,y) = \frac{f_h \cdot b_h}{D_c(x,y)}, \quad d_v(x,y) = \frac{f_v \cdot b_v}{D_c(x,y)}$

The consistency loss comprises three components: - Object stereo consistency: $L_{obj\text{-}stereo}$, R-L1 loss between the warped object image and the center-view object image. - Full stereo consistency: $L_{full\text{-}stereo}$, R-L1 loss between the synthesized full image and the ground truth. - Disparity smoothness: $L_{smooth}$, edge-aware disparity regularization.

Design Motivation: Single-baseline stereo provides constraints in only one direction. Orthogonal horizontal and vertical baselines yield stronger spatial constraints and better resolve geometric ambiguities in scattering media. $b_v$ is sampled from $[-0.4, 0.4]$ and $b_h = 1.5 b_v$, using unequal baselines to increase constraint diversity.

Synthetic Epipolar Depth Prior

Mechanism: Self-supervised depth priors $D_{epi}$ are derived via triangulation between virtual viewpoints, requiring no external depth supervision.

Specific steps: - Select 3D Gaussians within the trinocular view frustum intersection with opacity $> \tau_\alpha$. - Project selected Gaussians onto the image planes of $P_h$ and $P_v$. - Establish a linear system $\mathbf{A}_i \tilde{\mathbf{X}}_i = \mathbf{0}$ via epipolar geometry. - Solve for triangulated points via least squares, transform to the center camera coordinate system, and take the z-component as the depth prior.

An edge-aware Log-L1 loss is applied: $L_{epi} = \frac{1}{HW}\sum_{x,y}\sum_{k}\log(1 + |D_c' - D_{epi}|) \cdot e^{-|\nabla_k I_c|}$

Design Motivation: Geometric cues in underwater scenes are limited and external depth models may be inaccurate. Using geometric relationships between self-generated virtual viewpoints provides a self-consistent depth constraint, avoiding external dependencies.

Depth Residual Loss

Constrains the z-component of each 3D Gaussian to be consistent with the alpha-blending rendered depth: $L_{res} = \frac{1}{N'}\sum_{i=1}^{N'}|D_c(\mathbf{x}_i) - z_i|$

This prevents 3D Gaussians from spreading excessively along rays, reducing floating artifacts.

Depth-aware Alpha Adjustment

During early training ($t < t_\alpha$), an MLP adjusts the opacity of each 3D Gaussian based on depth and viewing direction: $\alpha_i' = (1-w)\alpha_i + w \cdot \phi_\alpha(\alpha_i, z_i, \vec{\mathbf{v}}_i)$

After the transition step $t_\alpha$, the weight $w$ decays to zero, eliminating inference overhead.

Design Motivation: In scattering media, misplaced 3D Gaussians absorb medium color contributions. Suppressing the opacity of such Gaussians during early training encourages their pruning, preventing medium-induced artifacts at the source.

Loss & Training¶

\[L_{total} = L_{photo} + \lambda_{tri} L_{tri} + \lambda_{epi} L_{epi} + \lambda_{res} L_{res}\]

$L_{photo}$: weighted R-L1 + R-SSIM ($\lambda_s = 0.2$)
$\lambda_{tri} = 0.1$, $\lambda_{res} = 0.01$
$\lambda_{epi}$ annealed from 0.4 to 0.2
Training steps: 7K/3K (densification/fine-tuning) for SeaThru-NeRF data; 10K/5K for In-the-Wild data
Progressive resolution training: $1/4 \to 1/2 \to$ full resolution

Key Experimental Results¶

Main Results¶

Real underwater scenes (SeaThru-NeRF + In-the-Wild):

Dataset	Metric	OceanSplat	WaterSplatting	SeaSplat	Gain
Curaçao	PSNR	34.56	32.32	29.77	+2.24
Panama	PSNR	32.74	31.71	28.65	+1.03
J.G-Redsea	PSNR	25.35	24.77	23.07	+0.58
IUI3-Redsea	PSNR	30.17	29.84	27.23	+0.33
Coral	PSNR	29.15	28.19	28.41	+0.96
Composite	PSNR	26.39	25.47	26.22	+0.92

Average PSNR surpasses WaterSplatting by 1.05 dB and SeaThru-NeRF-NS by 2.88 dB.

Simulated scattering scenes (underwater + fog):

Scene	Metric	OceanSplat	WaterSplatting	SeaSplat
Underwater-NVS	PSNR	28.80	28.12	15.62
Fog-NVS	PSNR	29.12	28.45	27.52
Underwater-Restoration	SSIM	0.768	0.748	0.719
Fog-Restoration	SSIM	0.791	0.770	0.744

Ablation Study¶

Configuration	PSNR	SSIM	LPIPS	Note
Full Model	34.56	0.961	0.113	Complete model
w/o $L_{res}$	34.30	0.960	0.115	Depth residual loss is effective
w/o $L_{epi}$	33.82	0.959	0.120	Epipolar depth prior contributes significantly
w/o $L_{tri}$	33.20	0.957	0.115	Trinocular consistency contributes most (−1.36 dB)
w/o $\alpha^d$	33.90	0.960	0.116	Depth-aware alpha adjustment is effective

Efficiency comparison: training in 19 minutes (vs. SeaThru-NeRF at 18h25m), inference at 85.67 FPS, GPU memory 7.6 GB.

Key Findings¶

Trinocular consistency is the most important component (PSNR drops 1.36 dB upon removal).
The epipolar depth prior ranks second in contribution (−0.74 dB).
Depth-aware alpha adjustment is notably effective in suppressing medium artifacts.
All above components are self-supervised, requiring no external depth ground truth or annotations.

Highlights & Insights¶

Well-motivated geometric extension to trinocular: Compared to binocular methods that provide only horizontal constraints, adding a vertical virtual viewpoint introduces orthogonal constraints with a solid theoretical basis in stereo geometry.
Fully self-supervised depth regularization: The synthetic epipolar depth prior is derived from triangulation between the model's own virtual viewpoints, relying on no external depth model and achieving a self-consistent geometric constraint.
Object–medium disentanglement: Effective geometric constraints promote separation between 3D Gaussians and the scattering medium, improving both reconstruction quality and scene restoration (dewatering/defogging).
Preventive strategy via early alpha adjustment: Rather than correcting artifacts after they emerge, this approach suppresses potentially problematic 3D Gaussians from the outset of training.

Limitations & Future Work¶

Each iteration requires additional rasterization (virtual viewpoint rendering) and least-squares solving, making training slightly longer than WaterSplatting (19 min vs. 10 min).
The virtual viewpoint baseline lengths $b_h, b_v$ are empirically determined and may be sensitive to varying scene scales.
Validation is currently limited to static underwater scenes; dynamic scenarios (water currents, bubbles) are not addressed.
The scattering model remains simplified and does not account for complex wavelength-dependent scattering effects.

WaterSplatting (2024): A hybrid method combining implicit medium with explicit objects; the primary comparison baseline for this paper.
SeaSplat (2024): Incorporates underwater physics into 3D Gaussian Splatting but lacks sufficient geometric constraints.
StereoGS (2024, Han et al.): Regularizes 3DGS with binocular stereo consistency; this paper extends the idea to trinocular.
Insight: The paradigm of constructing constraints via virtual viewpoints can be generalized to 3D reconstruction in other degraded scenes (fog, smoke, dust).

Rating¶

Novelty: ⭐⭐⭐⭐ (Trinocular extension is well-motivated; self-supervised depth prior is cleverly designed)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Real + simulated, NVS + restoration, complete ablations, detailed efficiency comparison)
Writing Quality: ⭐⭐⭐⭐⭐ (Complete mathematical derivations, clear illustrations, thorough explanation of physical motivation)
Value: ⭐⭐⭐⭐ (Significant advancement in underwater scene reconstruction; self-supervised design offers strong practicality)