CVPR 2026 3D Vision Open-vocabulary 3D segmentation joint point cloud–panorama segmentation icosahedral tangential decomposition SAM+CLIP semantic alignment 3D-to-panorama back-projection

JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas¶

Conference: CVPR 2026 arXiv: 2603.06168 Code: None Area: 3D Vision Keywords: Open-vocabulary 3D segmentation, joint point cloud–panorama segmentation, icosahedral tangential decomposition, SAM+CLIP semantic alignment, 3D-to-panorama back-projection

TL;DR¶

This paper proposes JOPP-3D — the first open-vocabulary semantic segmentation framework that jointly processes 3D point clouds and panoramic images. It decomposes panoramas into 20 perspective views via icosahedral tangential projection to accommodate SAM/CLIP, extracts mask-isolated instance-level CLIP embeddings for 3D semantic segmentation, and back-projects results to the panoramic domain via depth correspondence. Without any training, the method achieves 80.9% mIoU on S3DIS, surpassing all supervised approaches.

Background & Motivation¶

Background: 3D semantic segmentation relies on large-scale annotations and fixed category sets. Vision-language models such as CLIP excel at open-vocabulary 2D segmentation, but perform poorly when applied directly to panoramas (spherical distortion) and 3D point clouds (lack of pre-training).

Limitations of Prior Work:

Spherical distortion in panoramas prevents foundation models pre-trained on perspective images (e.g., CLIP, SAM) from being directly applicable.
Cubemap-based approaches (6 faces × 90°) introduce boundary discontinuity artifacts; DAN-based adapters require supervised training.
Cross-modal alignment of 2D vision-language features to 3D is difficult — direct per-point CLIP encoding introduces substantial semantic noise.
Joint open-vocabulary semantic segmentation of panoramas and point clouds has not been explored.

Key Challenge: Extending the capabilities of CLIP/SAM to both panoramas and 3D point clouds in a training-free manner is non-trivial, as each modality presents distinct geometric challenges.

Goal: To establish a unified framework for open-vocabulary semantic segmentation of both point clouds and panoramas simultaneously.

Key Insight: Panoramas are projected onto the 20 tangent faces of a regular icosahedron to generate perspective views compatible with CLIP/SAM. 3D point clouds are reconstructed from these views, semantic alignment is performed at the 3D instance level, and results are back-projected to the panoramic domain.

Core Idea: Tangential decomposition → 3D instance extraction → mask-isolated CLIP semantic alignment → depth-correspondence panoramic back-projection.

Method¶

Overall Architecture¶

A three-stage training-free pipeline: (1) Tangential Decomposition — each panoramic RGB-D image is projected onto the 20 faces of a regular icosahedron, yielding 20 tangential perspective views (640×480, FOV=100°) with corresponding depth maps; 3D points from all views are aggregated and voxelized into a global point cloud. (2) 3D Instance Extraction + Semantic Alignment — Mask3D (weakly supervised) or SAM3D (unsupervised) generates 3D instance proposals; each instance is projected onto its \(K\) best tangential views, SAM generates 2D mask crops, CLIP encodes the masked crops, and multi-view features are averaged to obtain the instance semantic embedding. (3) Language Query + 3D-to-Panorama Back-Projection — natural language queries produce 3D semantic segmentation results, which are back-projected to the panoramic domain via depth correspondence.

Key Designs¶

Icosahedral Tangential Decomposition
- Panoramas are projected onto the 20 faces of a regular icosahedron, with each face having a FOV of 100° — exceeding the 73.1° of Eder et al. and the 90° of Cubemap approaches.
- Overlapping fields of view between adjacent faces eliminate the boundary discontinuity artifacts of Cubemap.
- Ray directions for each pixel are computed via face rotation matrices, mapped to equirectangular coordinates, and sampled with bilinear interpolation for RGB and nearest-neighbor interpolation for depth.
- Focal length is determined by the horizontal field of view, maximizing contextual coverage within a geometrically stable range.
- Local 3D point clouds are reconstructed from all 20 tangent faces and aggregated across all panoramas, then voxelized into a global reconstruction.
Mask-Isolated Instance-Level CLIP Encoding
- For each 3D instance, projections onto all tangential views are computed and the \(K\) views with the most projected points are selected.
- SAM is prompted with projected points to generate 2D instance masks and crops.
- Masking precedes CLIP encoding — masks are applied to crops before CLIP ingestion; normalized feature vectors from \(K\) views are averaged to obtain the instance semantic embedding.
- Masking is confirmed as a critical design choice through ablation: without it, semantics from large-area classes (floor/ceiling) severely contaminate other instances, causing Open mIoU to collapse from 74.6% to 33.6%.
Depth-Correspondence 3D-to-Panorama Semantic Back-Projection
- Each panoramic depth pixel is back-projected into 3D, and its semantic label is retrieved from the semantic point cloud via nearest-neighbor search.
- Cross-scene depth correspondence propagation: when adjacent panoramas share depth overlap in corridor or doorway regions, semantic labels from labeled neighbor panoramas are propagated to unlabeled regions in the current panorama.
- This addresses semantic incompleteness in regions of large depth discontinuity (doorways/corridors) that arise from direct nearest-neighbor lookup.

Loss & Training¶

JOPP-3D is a fully training-free inference pipeline: frozen Mask3D/SAM3D provides 3D instance proposals, frozen SAM performs 2D segmentation, frozen CLIP performs semantic encoding, and natural language queries enable open-vocabulary classification. The weakly supervised variant uses Mask3D pre-trained on S3DIS Areas 1, 2, 3, 4, and 6; the unsupervised variant uses SAM3D. Inference time: 4.8 min/panorama (single RTX A6000); 1.7 seconds per language query.

Key Experimental Results¶

Main Results¶

3D Point Cloud Semantic Segmentation

Dataset	Method	Supervision	mIoU (%)	mAcc (%)
S3DIS	PointTransformerV3	Fully supervised	73.4	78.9
	Concerto	Fully supervised	77.4	85.0
	OpenMask3D	Weakly supervised	36.7	43.6
	JOPP-3D(u)	Unsupervised	59.4	70.1
	JOPP-3D	Weakly supervised	80.9	87.0
ToF-360	SFSS-MMSI	Unsupervised	23.2	46.3
	JOPP-3D(u)	Unsupervised	30.9	47.5

Panoramic Image Semantic Segmentation

Dataset	Method	mIoU (%)	Open mIoU (%)
Stanford-2D-3D-s	PanoSAMic (fully supervised)	61.7	--
	OPS (weakly supervised)	41.1	42.6
	SAM3 (unsupervised)	54.2	62.8
	JOPP-3D	70.1	74.6
ToF-360	HoHoNet	27.5	--
	JOPP-3D(u)	30.7	47.4

Ablation Study¶

Configuration	Open mIoU (%)	Impact
Full JOPP-3D	74.6	--
w/o SAM Mask (direct CLIP without masking)	33.6	−41.0
w/o Tangential Decomp. (direct panorama)	41.4	−33.2
w/o Depth Correspondence	67.0	−7.6

Key Findings¶

Masked CLIP encoding contributes remarkably: 33.6% → 74.6% (+41.0%); unmasked CLIP features are severely contaminated by large-area classes.
Tangential decomposition is indispensable: 41.4% → 74.6% (+33.2%); CLIP/SAM nearly fails on spherically distorted images.
Depth correspondence provides a +7.6% gain, with the most significant improvements in doorway and corridor regions.
Mask3D vs. SAM3D: weakly supervised 74.6% vs. unsupervised 59.9%; high-quality 3D instance proposals are the primary performance bottleneck.
The open-vocabulary approach can retrieve fine-grained objects labeled as "clutter" in ground truth (clocks, posters, etc.), demonstrating practical value.

Highlights & Insights¶

The first open-vocabulary segmentation framework to jointly handle 3D point clouds and panoramic images, surpassing all supervised methods without any training.
The icosahedral tangential decomposition is an elegant design: 100° FOV provides better contextual coverage and fewer boundary artifacts than Cubemap.
The +41.0% ablation result from masked CLIP encoding is striking — a simple modification with a substantial effect.
The concept of using 3D as a consistency "anchor" for 2D is generalizable to video understanding, multi-view consistent segmentation, and related tasks.

Limitations & Future Work¶

Requires RGB-D panoramic input; scenes with only RGB panoramas cannot be processed.
The weakly supervised Mask3D variant requires pre-training data; cross-domain generalization (e.g., outdoor scenes) remains unverified.
Inference speed is slow (4.8 min/image), making real-time application impractical.
Coarse labels such as "clutter" penalize open-vocabulary methods for their fine-grained recognition ability in quantitative evaluation.
Validation is limited to indoor scenes; applicability to large-scale outdoor environments has not been explored.

vs. OpenMask3D: Both target open-vocabulary 3D segmentation, but OpenMask3D operates on perspective RGB-D sequences for instance segmentation, whereas this work addresses scene-level semantic segmentation from panoramas and point clouds — 80.9% vs. 36.7% mIoU.
vs. OPS: OPS requires training a DAN adapter to handle panoramic distortion; this work's training-free tangential decomposition outperforms it (70.1% vs. 41.1%), and OPS does not perform 3D segmentation.
vs. SAM3: An RGB-only method achieving 54.2% mIoU on panoramas; this work reaches 70.1% by incorporating depth information and 3D alignment.
Insights: Tangential decomposition combined with foundation models constitutes a general paradigm for panoramic image processing; mask-cropped CLIP instance-level semantic alignment is applicable to any task requiring open-vocabulary instance-level features.

Rating¶

Novelty: ⭐⭐⭐⭐ First to propose joint open-vocabulary segmentation of point clouds and panoramas; tangential decomposition and depth correspondence are novel designs.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Two datasets, dual-task (2D+3D) evaluation, four ablation conditions, and extensive qualitative analysis.
Writing Quality: ⭐⭐⭐⭐ Clear framework presentation, high-quality figures and tables, systematic method description.
Value: ⭐⭐⭐⭐⭐ Training-free approach surpasses supervised methods; tangential decomposition and masked CLIP paradigm are broadly reusable.