O3N: Omnidirectional Open-Vocabulary Occupancy Prediction¶
Conference: CVPR 2026 arXiv: 2603.12144 Code: Coming soon Area: Autonomous Driving / 3D Scene Understanding Keywords: Omnidirectional occupancy prediction, open-vocabulary, Mamba, contrastive learning, panoramic perception
TL;DR¶
O3N is the first purely vision-based, end-to-end omnidirectional open-vocabulary occupancy prediction framework. Through three core modules—Polar Spiral Mamba (PsM), Occupancy Cost Aggregation (OCA), and Natural Modality Alignment (NMA)—it achieves open-vocabulary 3D occupancy prediction under 360° panoramic image input that surpasses closed-set supervised methods.
Background & Motivation¶
Background: 3D semantic occupancy prediction has become a core perception task in autonomous driving and embodied intelligence. Existing methods such as MonoScene, VoxFormer, and SGN have made significant progress under closed-set settings. Meanwhile, panoramic/omnidirectional images are increasingly adopted for scene understanding in embodied agents due to their single-frame 360° coverage.
Limitations of Prior Work: (1) Existing 3D occupancy prediction methods are limited to narrow field-of-view inputs and predefined training category distributions, making them difficult to apply in open-world scenarios requiring comprehensive safety perception. (2) Severe geometric distortions and non-uniform sampling introduced by Equirectangular Projection (ERP) cause distant regions to occupy only a minimal portion of the image. (3) Ternary feature alignment among pixels, voxels, and text tends to overfit under imbalanced data distributions, causing novel-class semantic alignment to fail.
Key Challenge: A fundamental conflict exists between the geometric distortion characteristics of omnidirectional images and the precision requirements of open-vocabulary semantic alignment—ERP projection yields sparse pixels for distant regions, exacerbating the overfitting risk in cross-modal feature alignment.
Goal: Under omnidirectional visual input, the paper aims to simultaneously address three challenges: 360° spatial continuity modeling, open-vocabulary semantic generalization, and cross-modal feature alignment.
Key Insight: (1) Adapt to panoramic geometry via polar spiral scanning; (2) construct voxel-text cost volumes to replace direct feature alignment; (3) bridge the modality gap through gradient-free random walks.
Core Idea: Integrate the geometric properties of omnidirectional perception into the full pipeline of voxel representation, cost aggregation, and modality alignment, realizing the first omnidirectional open-vocabulary occupancy prediction framework.
Method¶
Overall Architecture¶
O3N takes panoramic ERP images as input and proceeds through four core stages: (1) a visual feature extractor for omnidirectional image feature extraction; (2) a 2D-to-3D view transformation that generates both Cartesian and cylindrical voxel representations; (3) a 3D decoder augmented with the PsM module to learn fine-grained spatial geometry and semantics; (4) an occupancy prediction head to produce final outputs. For the open-vocabulary branch, the OCA and NMA modules ensure ternary semantic consistency among pixels, voxels, and text.
Key Designs¶
-
Polar Spiral Mamba (PsM) Module:
- Function: Captures long-range dependencies in the intrinsic spatial structure of omnidirectional images and resolves data discontinuities near the poles in cylindrical voxels.
- Mechanism: Adopts a dual-branch architecture that compresses cylindrical voxels \(\mathbf{V}_p \in \mathbb{R}^{C \times R \times P \times Z}\) into BEV features \(\mathbf{B}_p \in \mathbb{R}^{C \times R \times P}\), then scans outward from the pole along a spiral path (P-SMamba), naturally aligning with the information density variation from near to far in panoramic imaging.
- Coordinate Fusion: At each layer, cylindrical voxels are resampled into Cartesian space, and the complementary advantages of both coordinate systems are aggregated via \(\mathbf{V}_f^i = \mathbf{V}_c^i + \Phi_{\rho(c)}(\mathbf{V}_p^i)\).
- Design Motivation: Standard 3D convolutions cannot handle the polar discontinuities of cylindrical coordinates; Transformers incur prohibitive computational costs; Mamba's linear complexity and long-sequence modeling capacity make it well-suited for this requirement.
-
Occupancy Cost Aggregation (OCA) Module:
- Function: Constructs a voxel-text cost volume and aggregates it along spatial and category dimensions, avoiding overfitting caused by direct discrete feature alignment.
- Mechanism: Computes cosine similarity between voxel embeddings \(\mathbf{V}\) and text embeddings \(\mathbf{T}\) as the occupancy cost \(C(i,l) = \frac{V_i \cdot T_l}{\|V_i\| \|T_l\|}\) to generate coarse 3D semantic masks; spatial aggregation is then performed via ASPP, and category aggregation via a linear Transformer.
- Supervision Strategy: A scene affinity loss \(\mathcal{L}_{oca}\) is used to capture inter-voxel semantic correlations, comprising Precision, Recall, and Specificity terms; during training, the loss is computed only over base-class voxels.
- Design Motivation: Direct cross-entropy supervision leads to isolated voxel-semantic mappings that weaken generalization; cost aggregation preserves the continuity of semantic relationships.
-
Natural Modality Alignment (NMA):
- Function: Gradient-free bridging of the modality gap between text embeddings and semantic prototypes, preventing over-reliance on seen semantics.
- Mechanism: Base-class prototypes are updated via EMA as \(\mathbf{P}_t^b = \alpha \cdot \mathbf{P}_{t-1}^b + (1-\alpha) \cdot \frac{1}{|\Omega_b|}\sum_{i \in \Omega_b} \mathbf{f}_{seg}(i)\); text embeddings and prototypes are then iteratively aggregated via random walks.
- Convergent Form: A closed-form solution is derived via the Neumann series: \(\mathbf{T}_t^\infty = (1-\beta)(\mathbf{I} - \beta^2 \mathcal{A})^{-1}(\beta \mathcal{S} \mathbf{P}_t^0 + \mathbf{T}_t^0)\).
- Design Motivation: Learning-based alignment strategies overfit to seen semantic distributions; a gradient-free approach preserves the ability to understand unlimited novel semantics.
Loss & Training¶
The total loss consists of three terms: \(\mathcal{L} = \mathcal{L}_{occ} + \mathcal{L}_{vox-pix} + \mathcal{L}_{oca}\)
- \(\mathcal{L}_{occ}\): Semantic occupancy supervision from MonoScene, including cross-entropy loss, semantic/geometric scene affinity loss, and frustum proportion loss.
- \(\mathcal{L}_{vox-pix}\): Voxel-pixel feature alignment loss from OVO.
- \(\mathcal{L}_{oca}\): Scene affinity loss from occupancy cost aggregation (computed only over base-class voxels).
During inference, base classes are predicted directly via the occupancy head; novel classes are predicted by combining the similarity between voxel embeddings and text embeddings with the OCA-predicted probabilities.
Key Experimental Results¶
Main Results (QuadOcc Validation Set)¶
| Method | Input | mIoU | Novel mIoU | Base mIoU |
|---|---|---|---|---|
| MonoScene (Fully Supervised) | C | 19.19 | 25.56 | 12.82 |
| OneOcc (Fully Supervised) | C | 20.56 | 27.53 | 13.59 |
| OVO (Open-Vocabulary) | C | 14.33 | 18.15 | 10.52 |
| O3N (Ours) | C | 16.54 | 21.16 | 11.92 |
| O3N Gain vs. OVO | - | +2.21 | +3.01 | +1.40 |
Ablation Study (QuadOcc)¶
| Configuration | mIoU | Novel mIoU | Base mIoU |
|---|---|---|---|
| Baseline (OVO) | 14.33 | 18.15 | 10.52 |
| + PsM | 15.21 | 19.43 | 11.00 |
| + PsM + OCA | 15.89 | 20.31 | 11.48 |
| + PsM + OCA + NMA (Full) | 16.54 | 21.16 | 11.92 |
Key Findings¶
- O3N under the open-vocabulary setting (mIoU 16.54) even surpasses certain fully supervised methods (e.g., SSCNet 14.60, LMSCNet 18.44).
- Novel-class mIoU improves from 18.15 to 21.16 (+3.01), validating a significant enhancement in open-vocabulary capability.
- The framework is generalizable: it also achieves improvement from 13.81 to 15.52 on the SGN-S backbone.
- State-of-the-art results are also achieved on the Human360Occ dataset, validating cross-scene generalization.
Highlights & Insights¶
- Originality: The paper is the first to define and address the omnidirectional open-vocabulary occupancy prediction task, unifying panoramic perception with open semantic prediction.
- Geometry-Aware Design: The spiral scanning path of PsM naturally aligns with the information density distribution of panoramic imaging, constituting an elegant inductive bias.
- Theoretical Elegance: NMA derives a closed-form solution via random walks, simultaneously avoiding the overfitting risk of gradient-based optimization and guaranteeing convergence.
- Modular Generality: O3N can be plug-and-play integrated into various occupancy prediction architectures such as MonoScene and SGN.
Limitations & Future Work¶
- The absolute performance on novel classes remains low (e.g., vehicle at only 0.52 mIoU), and recognition of extremely rare classes remains challenging.
- Base-class performance degrades under the open-vocabulary setting compared to fully supervised methods, indicating a persistent trade-off between open-vocabulary capability and closed-set accuracy.
- Validation is limited to panoramic datasets; extension to multi-camera inputs has not been explored.
- Computational overhead analysis is insufficient, and real-time feasibility for practical deployment requires further evaluation.
Related Work & Insights¶
- OVO (Tan et al., 2023): A pioneer in open-vocabulary occupancy prediction; this paper builds upon it by introducing omnidirectional perception and cost aggregation improvements.
- CAT-Seg (Cho et al., 2024): The cost aggregation idea from 2D open-vocabulary segmentation is extended in this paper to 3D voxel space.
- OneOcc (Shi et al., 2025): The state-of-the-art in omnidirectional fully supervised occupancy prediction, providing baselines and datasets for this paper.
- Insight: The paradigm of replacing direct feature alignment with cost aggregation merits broader adoption in other cross-modal tasks.
Rating (⭐)¶
| Dimension | Rating |
|---|---|
| Novelty | ⭐⭐⭐⭐ |
| Technical Depth | ⭐⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐ |
| Value | ⭐⭐⭐ |
| Overall | ⭐⭐⭐⭐ |