OccAny: Generalized Unconstrained Urban 3D Occupancy¶
Conference: CVPR 2026 arXiv: 2603.23502 Code: https://github.com/valeoai/OccAny Area: Autonomous Driving Keywords: 3D Occupancy Prediction, Generalization, Unconstrained Scenes, Visual Geometry Foundation Models, Semantic Segmentation
TL;DR¶
OccAny proposes the first generalized unconstrained urban 3D occupancy prediction framework, capable of predicting metric-scale occupancy voxels from monocular, sequential, or surround-view images in calibration-free, out-of-domain scenes. Through two key designs—Segmentation Forcing and Novel View Rendering—it surpasses all visual geometry baselines on KITTI and nuScenes.
Background & Motivation¶
Background: 3D Occupancy Prediction is a core perception task in autonomous driving, aiming to jointly estimate the occupancy state and semantic labels of dense voxels. Existing methods such as SurroundOcc and OccFormer have achieved strong results on nuScenes-Occ and SemanticKITTI.
Limitations of Prior Work: (1) Existing methods rely heavily on in-domain annotated data and precise sensor calibration parameters (intrinsics and extrinsics), preventing generalization to new scenes; (2) Visual geometry foundation models (e.g., DUSt3R, Depth Anything) exhibit strong generalization but lack geometry completion capability for occluded regions in urban scenes and metric-scale prediction accuracy; (3) No unified framework supports all three input modes—sequential, monocular, and surround-view—simultaneously.
Key Challenge: High-accuracy occupancy prediction requires proprietary data and calibration, which are often unavailable in practice. A gap exists between the generality of visual foundation models and the urban-scene specialization required for occupancy prediction.
Goal: To construct the first "unconstrained" 3D occupancy prediction framework that generates metric-scale occupancy predictions and segmentation features from arbitrarily configured camera inputs in fully uncalibrated, out-of-domain scenes.
Key Insight: The authors observe that the strong generalizable reconstruction capability of visual geometry foundation models (MUSt3R / Depth Anything 3) can be combined with the semantic capability of large-scale segmentation models (SAM2 / SAM3), with dedicated training strategies bridging the gap for urban scenes.
Core Idea: The paper proposes Segmentation Forcing to compel the model to learn occupancy representations consistent with segmentation outputs, and Novel View Rendering to achieve geometry completion via virtual viewpoint synthesis, thereby constructing a unified framework that retains generalization ability while adapting to urban scenes.
Method¶
Overall Architecture¶
OccAny adopts a two-stage pipeline: (1) a reconstruction stage, in which a visual geometry foundation model reconstructs a 3D point cloud from input images while a segmentation model extracts per-point semantic features; and (2) a rendering stage, in which Novel View Rendering infers the geometry of unobserved regions from virtual viewpoints to complete the occupancy. The final 3D point cloud is voxelized into an occupancy grid. The framework has two variants: OccAny (based on MUSt3R + SAM2) and OccAny+ (based on Depth Anything 3 + SAM3).
Key Designs¶
-
Segmentation Forcing:
- Function: Improves the semantic quality of occupancy predictions while enabling mask-level instance prediction.
- Mechanism: During training, segmentation masks generated by SAM2/SAM3 serve as supervision signals to enforce alignment between the output point cloud of the reconstruction model and the segmentation masks in feature space. Specifically, for each mask produced by the segmentation model, the features of 3D points within the masked region are distilled into a consistent semantic vector. This "distill" paradigm enables geometry reconstruction and semantic segmentation to be jointly learned in a unified space.
- Design Motivation: Directly using the raw reconstruction output of foundation models lacks fine-grained semantic discrimination. Through Segmentation Forcing, occupancy voxels carry not only geometric information but also segmentation features consistent with SAM, achieving joint improvement in geometry and semantics.
-
Novel View Rendering:
- Function: Completes 3D geometry in occluded and unobserved regions.
- Mechanism: At test time, the trained geometry model samples a number of virtual camera viewpoints (generated via rotation and translation) from the existing reconstructed point cloud, and renders depth maps and point clouds from these novel views. The rendered results from multiple virtual viewpoints are merged with the original reconstruction to form a more complete 3D point cloud. This is a test-time augmentation strategy that requires no additional training.
- Design Motivation: Reconstruction from a single or small number of real viewpoints inevitably suffers from occlusion blind spots. By rendering from virtual viewpoints, the model can "imagine" the back sides of occluded objects and distant regions, significantly improving occupancy recall.
-
Majority Pooling Voxelization:
- Function: Aggregates multi-source 3D point clouds into a voxel occupancy grid.
- Mechanism: All 3D points from reconstruction viewpoints and rendered viewpoints are mapped to a unified voxel grid. For each voxel, majority voting over the semantic labels of all contained points determines the final label. Two modes are supported: "separate" (reconstruction and rendering vote independently) and "unified" (combined voting).
- Design Motivation: Multi-view point clouds may produce conflicting semantic predictions in voxel space; majority voting is the most robust aggregation strategy.
Loss & Training¶
Training proceeds in two stages: (1) the reconstruction stage trains geometry prediction with a metric depth regression loss while applying the Segmentation Forcing feature distillation loss for semantic alignment; (2) the rendering stage trains depth prediction capability from virtual viewpoints. Both stages use joint training across multiple driving datasets (Waymo, VKITTI, DDAD, PandaSet, ONCE) to enhance generalization. OccAny is trained on 16 × A100 40G GPUs.
Key Experimental Results¶
Main Results¶
| Dataset / Setting | Metric | OccAny | OccAny+ | Prev. Best Baseline | Notes |
|---|---|---|---|---|---|
| KITTI 5-frame Geometry | IoU | 25.91 | 27.33 | <25 | Surpasses all visual geometry baselines |
| nuScenes Surround Geometry | IoU | 34.15 | 33.49 | <33 | No in-domain annotation required |
| KITTI 5-frame Semantic (distill) | mIoU / mIoU^SC | 7.30 / 13.54 | 6.49 / 13.31 | — | Segmentation Forcing |
| nuScenes Surround Semantic (distill) | mIoU / mIoU^SC | 6.65 / 10.31 | 7.20 / 11.51 | — | OccAny+ superior |
| KITTI 5-frame Semantic (pretrained) | mIoU / mIoU^SC | 7.62 / 13.75 | 8.03 / 13.17 | — | Using pretrained segmentation |
| nuScenes Surround Semantic (pretrained) | mIoU / mIoU^SC | 7.42 / 10.78 | 9.45 / 12.22 | — | OccAny+ leads by a large margin |
Ablation Study¶
| Configuration | KITTI IoU | Notes |
|---|---|---|
| OccAny Full (5-frame) | 25.91 | Full model |
| w/o Novel View Rendering | ~22 | Large drop upon removal |
| w/o Segmentation Forcing | ~24 | Significant degradation in semantic quality |
| OccAny+ Full (DA3 + SAM3) | 27.33 | Stronger foundation models yield further gains |
| 1-frame vs. 5-frame | 24.03 vs. 25.91 | Multi-frame input provides additional geometric cues |
Key Findings¶
- Novel View Rendering is the largest contributor to geometric IoU improvement, accounting for approximately 3–4 absolute IoU points.
- OccAny+ (Depth Anything 3 1.1B + SAM3) achieves better overall performance, though OccAny (MUSt3R + SAM2) remains competitive in certain settings.
- The framework achieves performance close to in-domain self-supervised methods under fully out-of-domain conditions (without having seen KITTI/nuScenes), demonstrating strong generalization.
- On depth estimation metrics, OccAny+ recon 1.1B achieves an AbsRel of only 9.58% on KITTI, far surpassing DA3's original 33.28%.
- On ego-trajectory evaluation, OccAny+ recon 1.1B achieves an ADE of 0.90, outperforming DA3 1.1B (1.12).
Highlights & Insights¶
- First truly generalized 3D occupancy prediction framework: No target-domain annotations, calibration, or fine-tuning are required; the model works directly at inference time. This is of significant practical deployment value.
- Test-time augmentation via novel view rendering is an elegant design: it adds no training complexity and completes geometry solely by "imagining" unseen viewpoints at inference time—a principle transferable to any 3D reconstruction task.
- The modular design of the framework is instructive: geometry reconstruction and semantic segmentation are handled by dedicated foundation models and bridged via Segmentation Forcing, such that upgrading a foundation model upgrades the entire system.
Limitations & Future Work¶
- Inference speed is relatively slow: some settings require rendering 150–180 virtual viewpoints, resulting in considerable single-GPU inference time.
- Absolute semantic mIoU remains low (7–9%), with a substantial gap compared to supervised methods, primarily limited by the alignment quality between SAM features and specific semantic categories.
- In heavily occluded dense urban scenes (e.g., narrow alleyways), the geometry completion effect of novel view rendering may be limited.
- The current framework supports only static scenes; temporal consistency for dynamic objects (pedestrians, vehicles) is not modeled.
Related Work & Insights¶
- vs. SurroundOcc / OccFormer: These methods require in-domain 3D annotation training and depend on precise calibration, offering poor generalization. OccAny achieves fully zero-shot generalization; while its absolute accuracy is slightly lower, its generality far exceeds these approaches.
- vs. DUSt3R / MASt3R: Visual geometry foundation models provide strong generalizable reconstruction but lack occupancy completion and semantic prediction. OccAny adds segmentation fusion and novel view rendering on top of them.
- vs. Depth Anything 3: DA3 provides strong monocular depth estimation; OccAny+ replaces MUSt3R with it as the geometry backbone, demonstrating the flexibility of the framework.
- The Segmentation Forcing mechanism in this paper is analogous to knowledge distillation but applied to geometry–semantics alignment, and is transferable to other 3D perception tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First generalized unconstrained 3D occupancy framework; both core designs are genuinely novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers KITTI and nuScenes, three input modes, and includes reconstruction and trajectory evaluation.
- Writing Quality: ⭐⭐⭐⭐ — Logically clear, though the method section is somewhat complex in places.
- Value: ⭐⭐⭐⭐⭐ — Extremely high practical deployment value; addresses the core generalization bottleneck in 3D occupancy prediction.