CVPR2026 Autonomous Driving 3D occupancy prediction test-time 3D Gaussian Splatting vision foundation models self-supervised open-vocabulary

TT-Occ: Test-Time 3D Occupancy Prediction¶

Conference: CVPR2026 arXiv: 2503.08485 Code: Xian-Bei/TT-Occ Area: Autonomous Driving / 3D Occupancy Prediction Keywords: 3D occupancy prediction, test-time, 3D Gaussian Splatting, vision foundation models, self-supervised, open-vocabulary

TL;DR¶

This paper proposes TT-Occ, a training-free test-time 3D occupancy prediction framework that integrates vision foundation models (VFMs) at inference time to incrementally construct, refine, and voxelize temporally-aware 3D Gaussians. TT-Occ surpasses all self-supervised methods requiring extensive training on both Occ3D-nuScenes and nuCraft benchmarks.

Background & Motivation¶

Importance of 3D Occupancy Prediction: 3D occupancy prediction requires accurately identifying regions occupied by objects of specific categories as well as free space, which is critical for collision-free trajectory planning and reliable navigation in autonomous driving.

High Annotation Cost of Supervised Methods: Existing fully supervised approaches rely heavily on dense per-frame 3D annotations, which are extremely costly to obtain in dynamic driving scenarios (covering an 80m range per frame).

High Training Cost of Self-Supervised Methods: Although self-supervised methods reduce annotation costs, their training overhead remains substantial — for instance, SelfOcc requires approximately 384 GPU hours (8 GPUs for 2 days) at 0.4m resolution on Occ3D-nuScenes.

Poor Generalization: Once trained, adapting to finer resolutions (e.g., 0.2m on nuCraft) or novel object categories demands extensive retraining, limiting flexibility.

The Rise of VFMs Changes the Landscape: 3D vision foundation models such as VGGT and MapAnything provide reliable multi-view geometry, while REX-Omni supports open-vocabulary semantic reasoning — all accessible directly at test time without task-specific training.

Core Problem: Given that geometric and semantic information no longer needs to be learned by a network, is training an occupancy prediction model still necessary? TT-Occ answers in the negative.

Method¶

Overall Architecture: Lift-Track-Voxelize¶

TT-Occ follows a three-stage Lift-Track-Voxelize pipeline, offering two variants: a LiDAR-based variant (TT-OccLiDAR) and a camera-only variant (TT-OccCamera).

Step 1: Lift — Lifting Geometry and Semantics into 3D Gaussians¶

Modality-Specific Initialization:
- TT-OccLiDAR: Sparse LiDAR points are directly initialized as 3D Gaussians, inheriting accurate spatial positions from real-world measurements.
- TT-OccCamera: A 3D vision foundation model (VGGT/MapAnything) estimates dense depth maps from multi-view RGB inputs; multi-view triangulation is applied to resolve scale ambiguity.
VFM Semantics: An open-vocabulary segmentation model (OpenSeeD/GroundingSAM2/REX-Omni) extracts semantic maps from \(M\) surround-view images, which are fused into 3D via visibility-weighted projection: \(\mathbf{m}_i = \frac{1}{M}\sum_{m=1}^{M}\mathbb{I}_m(\boldsymbol{\mu}_i)\mathcal{M}_m(\text{Proj}(\boldsymbol{\mu}_i;\mathbf{K}_m,\mathbf{E}_m))\)
Voxel-Aware Simplification: Scale parameters are constrained with sigmoid (rather than exponential) to prevent excessive growth; redundant Gaussians within the same voxel are pruned and their semantic probabilities are merged.

Step 2: Track — Tracking Dynamic Gaussians¶

Fast-moving objects (vehicles, pedestrians) are typically only partially observed, and online optimization of 3D Gaussians introduces severe trailing artifacts.

TT-OccLiDAR: A learning-free scene flow estimation approach — LiDAR points are projected onto segmentation masks to associate instances → DBSCAN denoising → cross-frame cluster matching based on spatial/shape similarity → ICP-based 3D motion flow estimation.
TT-OccCamera: RAFT is used to estimate optical flow; ego-motion flow is subtracted to obtain residual dynamic flow → thresholding yields a dynamic mask identifying moving regions → corresponding 3D Gaussians are excluded from static accumulation (a compromise strategy to avoid noise amplification from 3D back-projection).

Step 3: Voxelize — Gaussian Voxelization¶

Differentiable Optimization: Gaussian parameters are refined at test time via a color consistency loss.
Trilateral Radial Basis Function (TRBF) Smoothing: An optional denoising module that periodically smooths Gaussian parameters by jointly leveraging spatial, color, and semantic affinity: \(\mathcal{K}(i,j) = \mathcal{K}_{\boldsymbol{\mu}}(i,j)\cdot\mathcal{K}_{\mathbf{c}}(i,j)\cdot\mathcal{K}_{\mathbf{m}}(i,j)\)
Flexible-Resolution Voxelization: Semantic probabilities are aggregated into a discrete occupancy grid weighted by spatial proximity, supporting arbitrary user-specified resolutions.

Loss & Training¶

Color consistency loss (projecting 3D Gaussians back to the image plane via differentiable rendering)
Sky region masking to exclude invalid areas

Key Experimental Results¶

Main Results¶

Occ3D-nuScenes (0.4m resolution):

Method	Input	Pretraining	mIoU
SelfOcc (CVPR'24)	C	~384 GPU hrs	10.54
GaussianTR (CVPR'25)	C	Yes	11.70
VEON-LiDAR (ECCV'24)	C&L	Yes	15.14
TT-OccCamera	C	None	16.70
RenderOcc (ICRA'24)	C	Yes (sparse 3D GT)	23.93
TT-OccLiDAR	C&L	None	27.41
BEVFormer (ECCV'22)	C	Yes (dense 3D GT)	26.88

nuCraft High-Resolution (0.2m resolution):

Method	Pretraining Time	mIoU
SelfOcc†	384 hrs	2.22
TT-OccCamera	0	5.95
TT-OccLiDAR	0	10.92

Ablation Study¶

Configuration	TT-OccLiDAR mIoU	TT-OccCamera mIoU
A: Baseline (single-frame direct splatting)	7.3	4.2
B: + Covariance-aware voxelization	18.3 (+11.0)	8.5 (+4.3)
C: + Inheriting historical Gaussians (no tracking)	23.5 (+5.2)	14.1 (+5.6)
D: + Dynamic Gaussian tracking	25.6 (+2.1)	14.1 (+0.0)

Key Findings¶

Zero-Training Outperforms Trained Methods: TT-OccLiDAR (27.41 mIoU) surpasses even RenderOcc (23.93), which is trained with sparse 3D GT; TT-OccCamera (16.70) outperforms VEON-LiDAR (15.14), trained with LiDAR supervision.
Strong Resolution Adaptability: On the nuCraft high-resolution setting, SelfOcc drops sharply from 10.54 to 2.22, while TT-Occ adapts without any retraining.
RayIoU Validates Geometric Quality: TT-OccCamera improves over SelfOcc by 30.8% on RayIoU@4; TT-OccLiDAR improves by 115%.
Modular VFM Design: REX-Omni yields the strongest semantics; MapAnything depth outperforms VGGT due to metric-scale depth estimation; the framework supports plug-and-play replacement of VFM components.
Dynamic Tracking Eliminates Trailing Artifacts: Without tracking, IoU for dynamic classes (bus, pedestrian) degrades significantly and recovers markedly with tracking.
Memory Efficiency: Peak GPU memory is 5.6 GB for the LiDAR variant and 9.9 GB for the camera variant, both under 10 GB.

Highlights & Insights¶

Paradigm Innovation: This is the first work to demonstrate that integrating VFMs at test time can fully replace training a dense occupancy decoder, shifting 3D occupancy prediction from a training paradigm to an inference paradigm.
Exceptional Flexibility: Supports arbitrary voxel resolutions, open-vocabulary semantic queries, and plug-and-play VFM component replacement.
Zero Training Cost: Eliminates hundreds of GPU hours of pretraining; inference runs directly on the validation set.
Temporally-Aware Gaussians: The Lift-Track-Voxelize pipeline enables online incremental scene reconstruction with dynamic-static decomposition to eliminate trailing artifacts.
TRBF Smoothing: Extends bilateral filtering to a trilateral kernel that performs adaptive denoising jointly over spatial, color, and semantic dimensions.

Limitations & Future Work¶

Dependence on VFM Quality: Performance is upper-bounded by the capabilities of the VFMs employed; with the weaker GroundingSAM2 semantics, mIoU drops to 21.3.
Limited Dynamic Handling for Camera Variant: Unlike the LiDAR variant, the camera-only variant cannot accumulate dynamic objects and can only exclude dynamic regions.
Inference Speed: Semantic segmentation (OpenSeeD) accounts for 28.5%–77.9% of total runtime; the camera variant additionally requires depth estimation, triangulation, and other steps.
Camera Variant Degrades at Long Range: Due to occlusion and depth resolution limitations, the camera-only variant achieves inferior geometric accuracy at long distances compared to the LiDAR variant.
Open-Vocabulary Sensitivity to Prompt Quality: VFM semantic segmentation relies on text prompts; while predefined label sets are used on standard benchmarks, prompt design affects performance in real open-world scenarios.

Fully Supervised Occupancy Prediction: BEVFormer, CTF-Occ, RenderOcc — rely on dense/sparse 3D annotations.
Self-Supervised Occupancy Prediction: SelfOcc (SDF + multi-view stereo), OccNeRF (photometric consistency), GaussianOcc/GaussianTR (3DGS representation), LangOcc/VEON (open-vocabulary).
3D Reconstruction for Driving Scenes: OmniRe, Street Gaussians, DrivingGaussian, HUGS — depend on external priors (HD maps, GT bounding boxes) for offline per-scene reconstruction; by contrast, TT-Occ performs online inference using only raw sensor streams.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The first training-free test-time 3D occupancy prediction framework; a paradigm-level contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ — Two datasets, two modality variants, multi-VFM combination ablations, and RayIoU evaluation; comprehensive overall.
Writing Quality: ⭐⭐⭐⭐ — Well-structured; the Lift-Track-Voxelize framework is intuitive and easy to follow.
Value: ⭐⭐⭐⭐⭐ — Challenges the necessity of training occupancy models in the VFM era, with far-reaching implications for autonomous driving perception paradigms.