REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion¶

Conference: CVPR 2026 arXiv: 2601.16788 Code: None Area: Semantic Segmentation Keywords: Panoramic semantic segmentation, depth representation, multimodal fusion, cylindrical coordinates, RGB-D

TL;DR¶

This paper proposes REL, a three-channel depth representation based on cylindrical coordinates (Rectified Depth + EGVIA + LOA), and a Spherical Multi-Modal Fusion module (SMMF) for panoramic semantic segmentation. The approach achieves 63.06% average mIoU on Stanford2D3D (a 2.35% gain over the HHA baseline) and reduces performance variance under 3D perturbations by approximately 70%.

Background & Motivation¶

Background: Panoramic semantic segmentation (PASS) aims at full scene understanding from a 360°×180° ultra-wide field of view and is widely applied in autonomous driving, AR/VR, and related domains. Dominant approaches adopt Equirectangular Projection (ERP) to convert spherical data into 2D images for processing.

Limitations of Prior Work: The HHA representation widely used in RGB-D methods has two critical shortcomings: (a) the second degree of freedom of the surface normal direction (lateral azimuth) is absent, leading to incomplete geometric information; (b) HHA computation relies on camera pose and intrinsics (e.g., focal length), making it inconvenient for processing pure image data.

Key Challenge: ERP-projected panoramic images exhibit large distortion and local feature variation across regions, yet existing multimodal fusion strategies apply the same fusion scheme globally, lacking region-adaptive capability. Moreover, cylindrical unrolling disrupts the continuity of scene structure.

Goal: Design a more complete panoramic depth representation and achieve region-adaptive multimodal fusion to improve the accuracy and robustness of panoramic semantic segmentation.

Key Insight: Exploit the cylindrical geometry inherent to ERP projection to design a three-channel cylindrical-coordinate-based representation (REL), and achieve spherically-aware region-level adaptive fusion by sampling overlapping regions on the lateral surface of the cylinder.

Core Idea: Use cylindrical coordinates \(\rho\theta z\) to fully encode 3D position and surface normal direction (REL), and employ Spherical Dynamic Mixture-of-Experts Fusion (SMMF) to apply different fusion strategies for different regions.

Method¶

Overall Architecture¶

REL-SF4PASS consists of two core components: (1) the REL depth representation module, which converts raw depth maps into a three-channel representation; and (2) the SMMF fusion module, which inserts fusion units at multiple stages of the network to perform region-adaptive RGB-REL feature fusion. The overall architecture adopts a dual-branch design that processes RGB and REL inputs separately.

Key Designs¶

REL Representation (3 channels):
Rectified Depth (ReD): Planar distance \(\rho = d \cdot \cos(\phi)\), eliminating the influence of the height direction.
EGVIA: Observes a strong linear correlation between normalized image height \(H\) and the normal-gravity inclination angle \(A\) in panoramic images; fuses both via \(\lambda \cdot A + (1-\lambda) \cdot H\) for horizontal surfaces, and uses \(A\) alone for vertical surfaces.
LOA: The angle between the surface normal and the tangent direction \(\hat{T}\), supplying the second degree of freedom of the normal that HHA lacks.
REL requires only the raw depth map and no camera intrinsics or extrinsics.
SMMF (Spherical Dynamic Multi-Modal Fusion):
Samples \(m \times n\) overlapping regions on the lateral surface of the cylinder; horizontally, regions are allowed to wrap across the left and right boundaries of the panoramic image.
Vertically, regions expand symmetrically upward and downward from the equator (\(\phi = 0°\)), ensuring equatorial symmetry.
Each region has an independent Gate Network that determines the fusion strategy (MoE architecture with \(B\) expert fusion operations).
A two-stage soft-to-hard training scheme is adopted: soft-probability training of all experts first, followed by hard one-hot expert selection.
Fusion Early-Stopping Mechanism: If a fusion unit at a given stage selects "no fusion" (RGB only), all subsequent fusion units are forced to adopt the same no-fusion decision, avoiding redundant computation.

Loss & Training¶

Standard cross-entropy loss for semantic segmentation. Two-stage training: in the soft training stage, expert selection probabilities are non-zero for all experts; in the hard training stage, probabilities are enforced to be one-hot.

Key Experimental Results¶

Main Results (Stanford2D3D Panoramic Dataset)¶

Method	Modality	3-fold Avg. mIoU	Fold1 mIoU
Trans4PASS+ (2024)	RGB	53.7%	53.6%
SGTA4PASS (2023)	RGB	55.3%	56.4%
Twin (2025)	RGB	55.85%	-
SFSS (2024)	RGB-HHA	60.60%	-
CMX* (reproduced)	RGB-HHA	60.71%	63.98%
REL-SF4PASS (Ours)	RGB-REL	63.06%	67.37%

Robustness under 3D Perturbations (SGA Evaluation, 16 Rotation Configurations)¶

Representation	mIoU Range	Performance Variance
HHA + SMMF	59.13~65.85%	High variance
REL + SMMF (Ours)	63.39~67.37%	~70% variance reduction

Key Findings¶

REL outperforms HHA under all 16 3D perturbation configurations, validating the rotational robustness of the cylindrical coordinate representation.
Region-level fusion via SMMF is more effective than uniform fusion; regions near the equator are semantically richer and benefit from finer-grained fusion.
REL requires no camera parameters, making it easier to generalize across different acquisition devices.
RGB-REL achieves 63.06% average mIoU, surpassing all known methods including RGB-HHA.

Highlights & Insights¶

Natural fit of cylindrical coordinates: Since panoramic images are produced by cylindrical projection, representing depth information in cylindrical coordinates is geometrically self-consistent.
Discovery of the plane-normal relationship: The statistical finding that \(H\) and \(A\) are highly correlated in panoramic images is the key insight motivating the design of EGVIA.
Camera-parameter-free: REL can be computed from the depth map alone, greatly simplifying the data processing pipeline.
70% variance reduction: The substantial improvement in robustness to 3D perturbations carries practical engineering value.

Limitations & Future Work¶

Validation is limited to the single Stanford2D3D dataset, with no evaluation on outdoor scenes.
The number of regions \(m \times n\) in SMMF requires manual specification; adaptive determination may yield better results.
The method has not been combined with recent large-scale pretrained backbones such as DINOv2.
Whether the information gain of the LOA channel remains stable in near-vertical surface scenarios warrants further analysis.

HHA representation is a classic design for RGB-D segmentation (Gupta et al.); REL serves as a direct and improved replacement.
CMX's cross-modal feature rectification and fusion module serves as the design inspiration for SMMF.
DynMM's sample-adaptive fusion is extended by SMMF to the region level, achieving finer granularity.

Rating¶

Novelty: ⭐⭐⭐⭐ (REL representation is rigorously designed; the cylindrical coordinate formulation is geometrically natural)
Experimental Thoroughness: ⭐⭐⭐ (single dataset, but robustness analysis is thorough)
Writing Quality: ⭐⭐⭐⭐ (mathematical derivations are clear; physical interpretations are well explained)
Value: ⭐⭐⭐⭐ (practical contribution to panoramic segmentation; directly substitutable for HHA)