ICCV 2025 3D Vision Single-view 3D reconstruction 3D Gaussian Splatting vision-language model text guidance spatial guidance point cloud features

CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image¶

Conference: ICCV 2025 arXiv: 2412.12906 Code: Project Page Authors: Wonseok Roh, Hwanhee Jung, Jong Wook Kim et al. (Korea University, Google, Purdue University) Area: 3D Vision / Novel View Synthesis / 3DGS Keywords: Single-view 3D reconstruction, 3D Gaussian Splatting, vision-language model, text guidance, spatial guidance, point cloud features

TL;DR¶

This paper proposes CATSplat, a generalizable Transformer framework for feed-forward single-view 3DGS reconstruction. It enhances image features via dual cross-attention using VLM text embeddings (contextual prior) and 3D point cloud features (spatial prior), achieving comprehensive improvements over Flash3D in PSNR/SSIM/LPIPS on RE10K and other datasets, with strong cross-dataset generalization.

Background & Motivation¶

Generalizable feed-forward methods based on 3DGS (pixelSplat, MVSplat) have succeeded in multi-view settings by exploiting cross-view correspondence, but the single-view setting suffers from severe information scarcity.
Flash3D pioneered single-view 3DGS feed-forward reconstruction (using a foundation depth estimation model), yet this setting remains underexplored.
Multi-view methods can leverage physical techniques such as triangulation to obtain 3D cues, which are unavailable in the single-view case.
Core Insight: Information from sources beyond visual cues must be incorporated—specifically, textual semantics and 3D geometric priors.

Core Problem¶

How to compensate for the extreme information deficiency inherent to single-image input by introducing textual context and 3D spatial priors, so as to achieve high-quality, generalizable 3DGS reconstruction?

Method¶

Overall Architecture¶

Input single-view image $\mathcal{I} \in \mathbb{R}^{H \times W \times 3}$
A pretrained monocular depth estimation model (UniDepth) predicts depth map $D$
Image and depth map are concatenated → ResNet encoder → multi-scale image features $F_i^{\mathcal{I}}$
VLM (LLaVA) generates a single-sentence scene description → intermediate text embeddings $F^C$ are extracted
Depth map is back-projected into a 3D point cloud $P$ → PointNet encoder → 3D spatial features $F^S$
Multi-resolution Transformer (3 layers) sequentially applies:
Cross-attention: $F_i^{\mathcal{I}} \times F_i^C$ → contextual fusion
Cross-attention: result $\times F_i^S$ → spatial fusion
Self-attention: feature refinement
ResNet decoder → predicts per-pixel Gaussian parameters $\{\mu_j, \alpha_j, \Sigma_j, c_j\}$
Rasterization renders novel views

Key Design 1: Contextual Prior via Text Guidance¶

A pretrained VLM (LLaVA) generates a one-sentence scene description from the input image.
The intermediate text embeddings $F^C \in \mathbb{R}^{N_c \times D^C}$ (rather than the final text output) are utilized, preserving rich multimodal semantic information.
Text features are softly fused into image features via cross-attention: Q from image features, K/V from text features.
The text embeddings encode scene type (e.g., kitchen), object identity (e.g., refrigerator, oven), spatial relationships, etc.—providing semantic bias for reasoning about occluded regions.
Prompt ablation: a single-sentence description outperforms scene-type labels, object lists, and multi-sentence descriptions (which may introduce exaggerated information).

Key Design 2: Spatial Prior via 3D Point Cloud¶

The 2D depth map is back-projected into a 3D point cloud via camera parameters: $p = K^{-1} \cdot u \cdot d$
A PointNet encoder extracts 3D features $F^S \in \mathbb{R}^{N_s \times D^S}$ from the point cloud.
A second cross-attention fuses 3D features (Q from context-enhanced image features, K/V from 3D features).
Superiority over conventional 2D depth usage: ablations demonstrate that cross-attention on 3D point features >> cross-attention on 2D depth features >> simple depth concatenation.

Key Design 3: Ratio $\gamma$ for Fusion Strength Control¶

A ratio $\gamma$ is introduced in the Add & Norm step to control the contribution of prior information: $$\tilde{F}_i = \text{Norm}(F_i^{\mathcal{I}} + \gamma \cdot \text{Dropout}(F_i^{\mathcal{I}CS}))$$
This prevents the original visual features from being overwhelmed by the prior signals.

Gaussian Parameter Prediction¶

Center $\mu$: A depth offset $\delta$ is predicted to refine the estimated depth $\tilde{d} = d + \delta$; a 3D offset $\Delta_j$ is added after back-projection for fine alignment.
Opacity $\alpha$: constrained to $[0,1]$ via sigmoid.
Covariance $\Sigma$: rotation matrix $R$ and scaling matrix $S$ are predicted, $\Sigma = RSS^TR^T$.
Color $c$: spherical harmonic coefficients.
Loss: $\mathcal{L} = \lambda_{\ell1}\mathcal{L}_{\ell1} + \lambda_{ssim}\mathcal{L}_{ssim} + \lambda_{lpips}\mathcal{L}_{lpips}$

Key Experimental Results¶

Main Results (RE10K, comparison with single-view methods)¶

Method	n=5 PSNR	n=10 PSNR	Random PSNR
Flash3D	28.46	25.94	24.93
CATSplat	29.09	26.44	25.45

Interpolation / Extrapolation (RE10K)¶

Interpolation: 25.23 dB (vs. Flash3D 23.87), narrowing the gap with two-view methods (pixelSplat 26.09).
Extrapolation: 25.35 dB, surpassing all two-view methods (MVSplat 23.04); a single image outperforms two-image methods, validating the effectiveness of the priors.

Cross-Dataset Generalization (trained on RE10K → zero-shot evaluation)¶

Target Dataset	Flash3D PSNR	CATSplat PSNR
NYUv2 (indoor)	25.09	25.57
ACID (outdoor)	24.28	24.73
KITTI (driving)	21.96	22.43

Ablation Study¶

Configuration	Random PSNR	Random LPIPS
Baseline (no prior)	25.02	0.159
+ Contextual prior	25.40	0.153
+ Spatial prior	25.42	0.153
+ Both	25.45	0.151

User Study¶

RE10K: 88.42% of users prefer CATSplat (vs. Flash3D 11.58%)
ACID: 91.41% prefer CATSplat

Highlights & Insights¶

Intermediate VLM embeddings are more informative than final text outputs: intermediate representations in the multimodal alignment space retain richer signals than raw textual descriptions.
3D point features >> 2D depth features: back-projecting depth into a point cloud and encoding it via PointNet followed by cross-attention significantly outperforms simple 2D depth concatenation.
Extrapolation is a strength of single-view methods: multi-view methods excel at interpolation but struggle with extrapolation (due to reliance on cross-view correspondence), whereas single-view methods augmented with priors actually surpass two-view methods on extrapolation.
Text prompt format matters: a single-sentence description is optimal; descriptions that are too long or too short both degrade performance.
Iterative cross-attention yields cumulative gains: applying cross-attention across all 3 layers outperforms applying it in only 1–2 layers.

Limitations & Future Work¶

Performance on occluded and truncated regions remains limited (as acknowledged by the authors).
Training is conducted solely on RE10K; limited data diversity may restrict practical applicability, and extending to more datasets could improve generalization.
VLM inference (LLaVA forward pass) introduces additional computational overhead, affecting real-time feasibility.
Temporal consistency and video sequence settings are not explored.

Rating¶

Novelty: ⭐⭐⭐⭐ — The dual-prior combination of intermediate VLM embeddings and 3D point cloud features for single-view 3DGS is novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multi-dataset, multi-metric evaluation with detailed ablations and a user study; experiments are very thorough.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with well-designed ablations.
Value: ⭐⭐⭐⭐ — The idea of using VLM embeddings as 3D priors and the paradigm of fusing multimodal priors via cross-attention are worth adopting.