Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots¶

Conference: CVPR2026 arXiv: 2603.13108 Code: PanoMMOcc (coming soon) Area: Autonomous Driving Keywords: Panoramic occupancy prediction, multimodal fusion, quadruped robots, semantic occupancy, BEV perception

TL;DR¶

This paper introduces PanoMMOcc, the first panoramic multimodal semantic occupancy prediction dataset for quadruped robots, along with the VoxelHound framework. By incorporating a Vertical Jitter Compensation (VJC) module and a Multimodal Information Prompt Fusion (MIPF) module, VoxelHound achieves 23.34% mIoU under a four-modality setup (panoramic RGB + thermal + polarization + LiDAR), surpassing existing methods by +4.16%.

Background & Motivation¶

Panoramic perception requirements: Panoramic images provide 360° blind-spot-free visual coverage, which is critical for mobile agents in dynamic unstructured environments. However, existing occupancy prediction methods are primarily designed for multi-camera narrow-FoV setups in autonomous driving.
Challenges of quadruped platforms: Compared to wheeled platforms, quadruped robots inherently suffer from low viewpoints, frequent self-occlusion, and strong ego-motion caused by gait dynamics—challenges that existing methods do not address.
Limitations of RGB-only sensing: Relying solely on the RGB modality yields insufficient robustness under illumination changes, low-texture regions, and long-range perception scenarios, necessitating multimodal sensor fusion.
Dataset gap: Existing panoramic datasets focus on 2D visual tasks and lack 3D occupancy annotations; existing occupancy benchmarks target autonomous driving and cover neither panoramic imaging nor quadruped platforms.
Insufficient fusion strategies: Common multimodal fusion approaches (simple concatenation or addition) treat heterogeneous sensor contributions indiscriminately, diluting geometric consistency and introducing cross-modal interference.
Gait-induced jitter: Quadruped locomotion causes vertical body oscillations, leading to spatial misalignment of captured image features and degrading the stability of BEV transformation.

Method¶

Overall Architecture: VoxelHound¶

The model takes four modalities as input: panoramic RGB image \(\mathcal{I}^{pal}\), thermal image \(\mathcal{I}^{th}\), polarization image \(\mathcal{I}^{pol}\), and LiDAR point cloud \(\mathcal{P}\). Each modality is processed by an independent encoder for feature extraction, projected into a unified BEV space for fusion, and the final output is a 3D semantic occupancy prediction \(\mathbf{O} \in \mathbb{R}^{X \times Y \times Z}\).

Multimodal Fusion Network¶

Camera branch: Each of the three image modalities is processed by a ResNet-18 backbone to extract multi-scale features, aggregated via FPN, and then transformed to BEV features \(\mathbf{F}_c^m \in \mathbb{R}^{C_m \times H \times W}\) via a 2D-to-BEV view transformation.
LiDAR branch: The point cloud is voxelized and processed by sparse 3D convolutions (stride=8) to extract hierarchical geometric features, which are collapsed into BEV features \(\mathbf{F}_l \in \mathbb{R}^{C_l \times H \times W}\).
Fusion branch: The fused features are enhanced by a SECOND-FPN BEV encoder for multi-scale spatial representation. The occupancy head reshapes BEV channels into vertical bins, producing a 64×64×16 voxel prediction.

Vertical Jitter Compensation Module (VJC)¶

Inserted between the image encoder and the BEV view transformation: 1. Mean pooling along the width dimension to obtain vertical structure \(\mathbf{F}_v \in \mathbb{R}^{C \times H}\). 2. Two-layer Conv1D + ReLU to encode vertical features. 3. Adaptive average pooling + linear layer to predict the global vertical offset \(\Delta h\). 4. Normalization into grid coordinate offsets to construct a displacement sampling grid. 5. Bilinear grid sampling to obtain the compensated feature \(\mathbf{F}_{comp}\).

Multimodal Information Prompt Fusion Module (MIPF)¶

Adopts an asymmetric fusion principle of "geometry-dominant, semantics-supplementary": 1. Each modality is projected to a shared embedding space via 1×1 convolution. 2. Global average pooling + MLP over image modalities generates compact semantic prompts \(\mathbf{p}_m\). 3. LiDAR BEV features serve as Query, while modality prompts serve as Key/Value in prompt attention. 4. Residual modulation: \(\mathbf{F}_f = \tilde{\mathbf{F}}_l + \sigma(\gamma(\mathbf{F}_{attn})) \odot \tilde{\mathbf{F}}_l\), ensuring geometric structure remains the primary representational basis.

Loss & Training¶

\[\mathcal{L}_{occ} = \mathcal{L}_{ce} + \mathcal{L}_{ls} + \mathcal{L}_{scal}^{geo} + \mathcal{L}_{scal}^{sem}\]

Comprising cross-entropy loss, Lovász-Softmax loss, and geometric/semantic affinity losses.

Key Experimental Results¶

Dataset: PanoMMOcc¶

Unitree Go2 quadruped robot equipped with a panoramic camera (360°×70° FoV, 2048×2048), MID360 LiDAR, thermal camera (640×512), and polarization camera (1224×1024).
54 sequences @10Hz, 40 seconds per sequence, totaling 21,600 frames; 42 sequences annotated with 12 semantic classes.
Voxel space: 64×64×16, resolution 0.4m, spatial range [-12.8, 12.8]m (xy) × [-2.4, 4.0]m (z).
30 sequences for training / 12 for testing, covering six scene types: campus, urban, residential, green space, rural, and forest.

Main Results (mIoU %)¶

Method	Modality	mIoU
MonoScene	C	8.94
EFFOcc-C	C	4.47
EFFOcc-L	L	18.77
EFFOcc-T	C+L	19.18
C-CONet	C	3.79
M-CONet	C+L	4.68
VoxelHound	C	5.79
VoxelHound	C+T+P	6.14
VoxelHound	C+L	22.87
VoxelHound	C+L+T+P	23.34

Full-modality VoxelHound outperforms the strongest competing method EFFOcc-T by +4.16% mIoU.
Compared to camera-only MonoScene: +14.40% mIoU.
Thermal and polarization modalities yield notable nighttime improvements: nighttime mIoU increases from 3.52% (C) to 4.07% (C+T+P).

Ablation Study¶

VJC	MIPF	mIoU
✗	✗	22.74
✓	✗	22.92
✗	✓	23.14
✓	✓	23.34

VJC contributes +0.18%, MIPF contributes +0.40%, and their combination yields an additional +0.20%.
Optimal hidden channel dimension for VJC: \(C_{hd}=64\); optimal MIPF settings: \(C_{pd}=8\), \(C_{nh}=8\).

Highlights & Insights¶

Pioneering contribution: The first panoramic multimodal occupancy prediction dataset and framework for quadruped robots, filling the gap at the intersection of panoramic occupancy and legged platforms.
Elegant VJC design: The module estimates vertical offsets via lightweight 1D convolutions for grid-sampling-based compensation, with minimal parameter overhead (+0.04M).
Asymmetric MIPF fusion: Image modalities are compressed into compact prompts to avoid dense spatial cross-attention; the "geometry-dominant, semantics-supplementary" design is well aligned with sensor characteristics.
Four-modality sensing: Integrating panoramic RGB, thermal, polarization, and LiDAR—four complementary modalities—with polarization imaging introduced to occupancy prediction for the first time.

Limitations & Future Work¶

Low absolute performance: The best mIoU of 23.34% remains limited, with some categories completely undetected (bicycle=0.00%, pedestrian=0.00%), indicating severe deficiency in small-object perception.
Marginal VJC gain: VJC alone contributes only +0.18%, suggesting limited compensation for gait-induced jitter; more sophisticated temporal modeling may be required.
Small dataset scale: The 21.6k frames are considerably fewer than driving datasets such as nuScenes, and generalization capability remains to be validated.
Evaluation on proprietary dataset only: VoxelHound is not validated on existing public occupancy benchmarks, leaving cross-dataset generalization unknown.
Nighttime degradation with full modality: Nighttime C+L+T+P (18.68%) underperforms C+L (19.17%), suggesting that thermal and polarization modalities introduce noise under certain conditions.

vs. MonoScene / EFFOcc / CONet: These methods are designed for autonomous driving with pinhole multi-camera setups, and are ill-suited for panoramic imaging on quadruped platforms. VoxelHound substantially outperforms them in the quadruped scenario under full-modality fusion.
vs. QuadOcc: QuadOcc is also a panoramic occupancy dataset for quadruped platforms, but is limited to RGB-only, 6 semantic classes, and 64×64×8 voxels. PanoMMOcc extends to four modalities, 12 classes, and 64×64×16 voxels.
Existing multimodal fusion: Most approaches apply simple concatenation or symmetric fusion of Camera+LiDAR. The asymmetric prompt attention in MIPF more effectively exploits cross-modal complementarity.
Panoramic perception: Prior work in this domain primarily addresses 2D semantic segmentation or BEV mapping. This paper is the first to extend panoramic vision to 3D occupancy prediction.

Rating¶

Novelty: ⭐⭐⭐⭐ (First quadruped panoramic multimodal occupancy dataset and framework; VJC and MIPF designs are well-motivated)
Experimental Thoroughness: ⭐⭐⭐ (Complete ablations, but low absolute performance and absence of cross-dataset experiments)
Writing Quality: ⭐⭐⭐⭐ (Clear structure with rich figures and tables)
Value: ⭐⭐⭐⭐ (Opens a new direction for panoramic occupancy on quadruped robots; dataset and benchmark hold long-term research value)