Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots¶

Conference: CVPR2025
arXiv: 2603.13108
Code: PanoMMOcc (To be open-sourced)
Area: Autonomous Driving
Keywords: occupancy prediction, panoramic perception, multimodal fusion, quadruped robot, 3D scene understanding

TL;DR¶

The first panoramic multimodal semantic occupancy prediction framework VoxelHound designed for quadruped robots. It introduces the PanoMMOcc dataset (panoramic RGB + thermal + polarization + LiDAR) and achieves 23.34% mIoU through the Vertical Jitter Compensation (VJC) and Multimodal Information Prompt Fusion (MIPF) modules.

Background & Motivation¶

Limitations of Prior Work¶

Limitations of Prior Work: Background: 1. Panoramic cameras provide 360° blind-spot-free visual coverage, which is essential for mobile robot perception, but existing occupancy prediction methods are mainly designed for wheeled autonomous driving platforms. 2. Compared to wheeled platforms, quadruped robots face unique challenges such as low sensor viewpoints, frequent self-occlusions, and strong self-motion induced by gait. 3. Relying solely on the RGB modality lacks robustness under varying lighting conditions, low-texture areas, and long-range perception scenarios. 4. Existing panoramic datasets mainly focus on 2D vision tasks (detection/segmentation) and lack 3D occupancy annotations. 5. Existing occupancy benchmark datasets (SemanticKITTI, Occ3D, etc.) are designed for autonomous driving and do not consider panoramic imaging or quadruped platforms. 6. There is a lack of a real-world quadruped robot dataset that simultaneously contains panoramic, thermal, polarization, and LiDAR modalities.

Method¶

Overall Architecture¶

VoxelHound is a panoramic multimodal semantic occupancy prediction framework. Given panoramic RGB images, thermal images, polarization images, and LiDAR point clouds, features are extracted via their respective encoders, unified and projected into the BEV space for cross-modal fusion, and finally, a 3D semantic occupancy prediction \(\mathbf{O} \in \mathbb{R}^{X \times Y \times Z}\) is generated by an occupancy head that restores the height dimension.

Key Designs¶

1. Multimodal Fusion Network - Camera Branch: Three image modalities (panoramic RGB, thermal, polarization) extract multi-scale features (1/8, 1/16, 1/32) through independent 2D backbones, which are aggregated by an FPN and projected to the BEV space via 2D-to-BEV view transformation. - LiDAR Branch: Point clouds are voxelized (averaging up to 10 points per voxel), hierarchical geometric features are extracted through a sparse 3D convolutional encoder (stride 8), and splatted onto the BEV space. - Fusion Branch: Multimodal BEV features are aggregated through a fusion module, refined by a SECOND-FPN BEV encoder, and the occupancy head reshapes the channel dimension into vertical bins.

2. Vertical Jitter Compensation (VJC) Module - Addresses vertical image jitter caused by quadruped gait. - Applies average pooling along the width dimension to obtain vertical structural information \(\mathbf{F}_v \in \mathbb{R}^{C \times H}\). - Employs two layers of Conv1D + ReLU encoding, followed by adaptive average pooling and a linear layer to estimate the global vertical offset \(\Delta h\). - Constructs an offset sampling grid \(\mathcal{G}(h,w) = \mathcal{G}_0(h,w) + (0, \Delta h)\) and performs bilinear grid sampling compensation. - Inserted between the image encoder and BEV view transformation, being lightweight and plug-and-play.

3. Multimodal Information Prompt Fusion (MIPF) Module - Asymmetric fusion principle: geometry-dominated + semantic-supplemented. - Project features of each modality to a shared embedding space using 1x1 convolutions. - Compress image modalities into modality-level semantic prompts \(\mathbf{p}_m\) via GAP + MLP. - Use LiDAR BEV features as queries and the prompt stack as keys/values to perform geometry-guided attention. - Residual modulation: \(\mathbf{F}_f = \tilde{\mathbf{F}}_l + \sigma(\gamma(\mathbf{F}_{attn})) \odot \tilde{\mathbf{F}}_l\), ensuring that the geometric structure is not overwritten.

Loss & Training¶

Standard semantic occupancy prediction losses (Cross-Entropy + Lovász-softmax) are utilized, following the settings of the EFFOcc framework.

Key Experimental Results¶

PanoMMOcc Dataset Statistics¶

Attribute	Value
No. of Sequences	54 (42 annotated)
Total Frames	21,600
Modalities	Panoramic RGB + LiDAR + Thermal + Polarized
Voxel Resolution	64×64×16, 0.4m³ each
No. of Classes	12
Platform	Unitree Go2 Quadruped Robot

Main Results (mIoU on PanoMMOcc)¶

Method	Modality	mIoU
MonoScene	C	8.94
EFFOcc-C	C	4.47
EFFOcc-L	L	18.77
EFFOcc-T	C+L	19.18
VoxelHound	C	5.79
VoxelHound	C+T+P	6.14
VoxelHound	C+L	22.87
VoxelHound	C+L+T+P	23.34 (+4.16%)

Key Findings¶

Quad-modal fusion (C+L+T+P) improves performance by approximately 0.5 mIoU compared to dual-modal fusion (C+L), with significant improvements in specific classes (terrain +2.13, manmade +2.18).
LiDAR is the key modality for occupancy prediction; pure camera methods show limited performance in panoramic scenarios.
VJC effectively mitigates BEV feature misalignment caused by gait.
The asymmetric design of MIPF outperforms simple concatenation, addition, or symmetric attention.

Highlights & Insights¶

Pioneering Work: The first panoramic multimodal occupancy dataset and framework designed for quadruped robots, filling a gap in embodied AI occupancy prediction.
Complementary Quad-modal Design: Thermal imaging enhances robustness in low light, polarization reveals material cues, LiDAR provides precise geometry, and panoramic RGB offers rich semantics.
Lightweight and Practical VJC Module: Employs 1D convolutions to estimate global vertical offsets, compensating for gait jitter without requiring an explicit IMU.
Asymmetric Fusion in MIPF: Anchors on LiDAR geometry and utilizes a prompt mechanism to avoid the high computational cost of dense cross-modal attention.

Limitations & Future Work¶

The absolute mIoU is low (23.34%), with predictions for "pedestrian" and "pillar" classes close to 0, indicating a severe long-tail issue.
The scale of the dataset is small (21.6K frames), and scene diversity is limited.
The impact of panoramic ERP distortion on the quality of 2D-to-BEV transformation has not been deeply analyzed.
The low viewpoint of the quadruped robot limits long-range perception, and the bottleneck of perception range is not discussed.

Complements vehicle-based occupancy datasets such as SemanticKITTI and SurroundOcc, providing a testing platform for the embodied AI community.
The concept of Vertical Jitter Compensation (VJC) can be extended to other unstable platforms like UAVs and underwater robots.
The prompt fusion paradigm of MIPF can be applied to other tasks requiring asymmetric modal fusion (e.g., RGB-Thermal segmentation).
Demonstrates the unique challenges of quadruped robot perception, laying the foundation for subsequent research in this direction.

Rating¶

Novelty: ⭐⭐⭐⭐ (First quadruped panoramic multimodal occupancy dataset + framework)
Experimental Thoroughness: ⭐⭐⭐ (Limited dataset scale, ablations could be more detailed)
Writing Quality: ⭐⭐⭐⭐ (Clear structure, detailed dataset description)
Value: ⭐⭐⭐⭐ (Fills a gap in embodied AI occupancy prediction)