TT-Occ: Test-Time 3D Occupancy Prediction¶
Conference: CVPR2026 arXiv: 2503.08485 Code: Xian-Bei/TT-Occ Area: Autonomous Driving / 3D Occupancy Prediction Keywords: 3D occupancy prediction, test-time, 3D Gaussian Splatting, vision foundation models, self-supervised, open-vocabulary
TL;DR¶
This paper proposes TT-Occ, a training-free test-time 3D occupancy prediction framework that integrates vision foundation models (VFMs) at inference time to incrementally construct, refine, and voxelize temporally-aware 3D Gaussians. TT-Occ surpasses all self-supervised methods requiring extensive training on both Occ3D-nuScenes and nuCraft benchmarks.
Background & Motivation¶
Importance of 3D Occupancy Prediction: 3D occupancy prediction requires accurately identifying regions occupied by objects of specific categories as well as free space, which is critical for collision-free trajectory planning and reliable navigation in autonomous driving.
High Annotation Cost of Supervised Methods: Existing fully supervised approaches rely heavily on dense per-frame 3D annotations, which are extremely costly to obtain in dynamic driving scenarios (covering an 80m range per frame).
High Training Cost of Self-Supervised Methods: Although self-supervised methods reduce annotation costs, their training overhead remains substantial — for instance, SelfOcc requires approximately 384 GPU hours (8 GPUs for 2 days) at 0.4m resolution on Occ3D-nuScenes.
Poor Generalization: Once trained, adapting to finer resolutions (e.g., 0.2m on nuCraft) or novel object categories demands extensive retraining, limiting flexibility.
The Rise of VFMs Changes the Landscape: 3D vision foundation models such as VGGT and MapAnything provide reliable multi-view geometry, while REX-Omni supports open-vocabulary semantic reasoning — all accessible directly at test time without task-specific training.
Core Problem: Given that geometric and semantic information no longer needs to be learned by a network, is training an occupancy prediction model still necessary? TT-Occ answers in the negative.
Method¶
Overall Architecture: Lift-Track-Voxelize¶
TT-Occ follows a three-stage Lift-Track-Voxelize pipeline, offering two variants: a LiDAR-based variant (TT-OccLiDAR) and a camera-only variant (TT-OccCamera).
Step 1: Lift — Lifting Geometry and Semantics into 3D Gaussians¶
- Modality-Specific Initialization:
- TT-OccLiDAR: Sparse LiDAR points are directly initialized as 3D Gaussians, inheriting accurate spatial positions from real-world measurements.
- TT-OccCamera: A 3D vision foundation model (VGGT/MapAnything) estimates dense depth maps from multi-view RGB inputs; multi-view triangulation is applied to resolve scale ambiguity.
- VFM Semantics: An open-vocabulary segmentation model (OpenSeeD/GroundingSAM2/REX-Omni) extracts semantic maps from \(M\) surround-view images, which are fused into 3D via visibility-weighted projection: \(\mathbf{m}_i = \frac{1}{M}\sum_{m=1}^{M}\mathbb{I}_m(\boldsymbol{\mu}_i)\mathcal{M}_m(\text{Proj}(\boldsymbol{\mu}_i;\mathbf{K}_m,\mathbf{E}_m))\)
- Voxel-Aware Simplification: Scale parameters are constrained with sigmoid (rather than exponential) to prevent excessive growth; redundant Gaussians within the same voxel are pruned and their semantic probabilities are merged.
Step 2: Track — Tracking Dynamic Gaussians¶
Fast-moving objects (vehicles, pedestrians) are typically only partially observed, and online optimization of 3D Gaussians introduces severe trailing artifacts.
- TT-OccLiDAR: A learning-free scene flow estimation approach — LiDAR points are projected onto segmentation masks to associate instances → DBSCAN denoising → cross-frame cluster matching based on spatial/shape similarity → ICP-based 3D motion flow estimation.
- TT-OccCamera: RAFT is used to estimate optical flow; ego-motion flow is subtracted to obtain residual dynamic flow → thresholding yields a dynamic mask identifying moving regions → corresponding 3D Gaussians are excluded from static accumulation (a compromise strategy to avoid noise amplification from 3D back-projection).
Step 3: Voxelize — Gaussian Voxelization¶
- Differentiable Optimization: Gaussian parameters are refined at test time via a color consistency loss.
- Trilateral Radial Basis Function (TRBF) Smoothing: An optional denoising module that periodically smooths Gaussian parameters by jointly leveraging spatial, color, and semantic affinity: \(\mathcal{K}(i,j) = \mathcal{K}_{\boldsymbol{\mu}}(i,j)\cdot\mathcal{K}_{\mathbf{c}}(i,j)\cdot\mathcal{K}_{\mathbf{m}}(i,j)\)
- Flexible-Resolution Voxelization: Semantic probabilities are aggregated into a discrete occupancy grid weighted by spatial proximity, supporting arbitrary user-specified resolutions.
Loss & Training¶
- Color consistency loss (projecting 3D Gaussians back to the image plane via differentiable rendering)
- Sky region masking to exclude invalid areas
Key Experimental Results¶
Main Results¶
Occ3D-nuScenes (0.4m resolution):
| Method | Input | Pretraining | mIoU |
|---|---|---|---|
| SelfOcc (CVPR'24) | C | ~384 GPU hrs | 10.54 |
| GaussianTR (CVPR'25) | C | Yes | 11.70 |
| VEON-LiDAR (ECCV'24) | C&L | Yes | 15.14 |
| TT-OccCamera | C | None | 16.70 |
| RenderOcc (ICRA'24) | C | Yes (sparse 3D GT) | 23.93 |
| TT-OccLiDAR | C&L | None | 27.41 |
| BEVFormer (ECCV'22) | C | Yes (dense 3D GT) | 26.88 |
nuCraft High-Resolution (0.2m resolution):
| Method | Pretraining Time | mIoU |
|---|---|---|
| SelfOcc† | 384 hrs | 2.22 |
| TT-OccCamera | 0 | 5.95 |
| TT-OccLiDAR | 0 | 10.92 |
Ablation Study¶
| Configuration | TT-OccLiDAR mIoU | TT-OccCamera mIoU |
|---|---|---|
| A: Baseline (single-frame direct splatting) | 7.3 | 4.2 |
| B: + Covariance-aware voxelization | 18.3 (+11.0) | 8.5 (+4.3) |
| C: + Inheriting historical Gaussians (no tracking) | 23.5 (+5.2) | 14.1 (+5.6) |
| D: + Dynamic Gaussian tracking | 25.6 (+2.1) | 14.1 (+0.0) |
Key Findings¶
- Zero-Training Outperforms Trained Methods: TT-OccLiDAR (27.41 mIoU) surpasses even RenderOcc (23.93), which is trained with sparse 3D GT; TT-OccCamera (16.70) outperforms VEON-LiDAR (15.14), trained with LiDAR supervision.
- Strong Resolution Adaptability: On the nuCraft high-resolution setting, SelfOcc drops sharply from 10.54 to 2.22, while TT-Occ adapts without any retraining.
- RayIoU Validates Geometric Quality: TT-OccCamera improves over SelfOcc by 30.8% on RayIoU@4; TT-OccLiDAR improves by 115%.
- Modular VFM Design: REX-Omni yields the strongest semantics; MapAnything depth outperforms VGGT due to metric-scale depth estimation; the framework supports plug-and-play replacement of VFM components.
- Dynamic Tracking Eliminates Trailing Artifacts: Without tracking, IoU for dynamic classes (bus, pedestrian) degrades significantly and recovers markedly with tracking.
- Memory Efficiency: Peak GPU memory is 5.6 GB for the LiDAR variant and 9.9 GB for the camera variant, both under 10 GB.
Highlights & Insights¶
- Paradigm Innovation: This is the first work to demonstrate that integrating VFMs at test time can fully replace training a dense occupancy decoder, shifting 3D occupancy prediction from a training paradigm to an inference paradigm.
- Exceptional Flexibility: Supports arbitrary voxel resolutions, open-vocabulary semantic queries, and plug-and-play VFM component replacement.
- Zero Training Cost: Eliminates hundreds of GPU hours of pretraining; inference runs directly on the validation set.
- Temporally-Aware Gaussians: The Lift-Track-Voxelize pipeline enables online incremental scene reconstruction with dynamic-static decomposition to eliminate trailing artifacts.
- TRBF Smoothing: Extends bilateral filtering to a trilateral kernel that performs adaptive denoising jointly over spatial, color, and semantic dimensions.
Limitations & Future Work¶
- Dependence on VFM Quality: Performance is upper-bounded by the capabilities of the VFMs employed; with the weaker GroundingSAM2 semantics, mIoU drops to 21.3.
- Limited Dynamic Handling for Camera Variant: Unlike the LiDAR variant, the camera-only variant cannot accumulate dynamic objects and can only exclude dynamic regions.
- Inference Speed: Semantic segmentation (OpenSeeD) accounts for 28.5%–77.9% of total runtime; the camera variant additionally requires depth estimation, triangulation, and other steps.
- Camera Variant Degrades at Long Range: Due to occlusion and depth resolution limitations, the camera-only variant achieves inferior geometric accuracy at long distances compared to the LiDAR variant.
- Open-Vocabulary Sensitivity to Prompt Quality: VFM semantic segmentation relies on text prompts; while predefined label sets are used on standard benchmarks, prompt design affects performance in real open-world scenarios.
Related Work & Insights¶
- Fully Supervised Occupancy Prediction: BEVFormer, CTF-Occ, RenderOcc — rely on dense/sparse 3D annotations.
- Self-Supervised Occupancy Prediction: SelfOcc (SDF + multi-view stereo), OccNeRF (photometric consistency), GaussianOcc/GaussianTR (3DGS representation), LangOcc/VEON (open-vocabulary).
- 3D Reconstruction for Driving Scenes: OmniRe, Street Gaussians, DrivingGaussian, HUGS — depend on external priors (HD maps, GT bounding boxes) for offline per-scene reconstruction; by contrast, TT-Occ performs online inference using only raw sensor streams.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The first training-free test-time 3D occupancy prediction framework; a paradigm-level contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Two datasets, two modality variants, multi-VFM combination ablations, and RayIoU evaluation; comprehensive overall.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured; the Lift-Track-Voxelize framework is intuitive and easy to follow.
- Value: ⭐⭐⭐⭐⭐ — Challenges the necessity of training occupancy models in the VFM era, with far-reaching implications for autonomous driving perception paradigms.