SAS: Segment Any 3D Scene with Integrated 2D Priors¶
Conference: ICCV 2025 arXiv: 2503.08512 Code: Project Page Area: 3D Vision Keywords: Open-vocabulary 3D segmentation, multi-model fusion, knowledge distillation, diffusion models, model capability construction
TL;DR¶
This paper proposes SAS, a framework that for the first time integrates the complementary capabilities of multiple 2D open-vocabulary models to learn better 3D representations. It aligns feature spaces across models via Model Alignment via Text, and quantifies per-category model recognition capability using diffusion-synthesized images through Annotation-Free Model Capability Construction. These components jointly guide multi-model feature fusion and 3D distillation, achieving substantial improvements over prior work on ScanNet v2, Matterport3D, and nuScenes.
Background & Motivation¶
3D scene understanding is a fundamental task for autonomous driving, virtual reality, and robotic manipulation. Traditional closed-set methods are limited to fixed category sets and cannot recognize unseen classes, motivating growing interest in open-vocabulary 3D understanding.
The dominant approach transfers capabilities of 2D open-vocabulary models (e.g., LSeg, SEEM) to 3D models via feature distillation. However, a fundamental problem exists:
Error propagation from a single teacher model: 2D models make mistakes on certain categories (e.g., SEEM misidentifies "picture" as "wall"), and distillation causes the 3D model to inherit the same errors.
The intuitive solution—multi-model fusion—faces two key challenges: - Misaligned feature spaces: Different 2D models (e.g., LSeg uses CLIP, SEEM uses its own encoder) have incompatible image-text feature spaces and cannot be directly fused. - Difficulty in quantifying model capability: Knowing which model performs better on which categories requires test images and annotations, which is impractical in zero-shot settings.
Core Insight: Different 2D models have complementary recognition capabilities across categories. If per-category capability can be quantified for each model, more reliable model features can be selected for each point, correcting single-model misidentifications.
Method¶
Overall Architecture¶
SAS consists of four stages: (1) aligning the feature spaces of multiple 2D models via text bridging; (2) evaluating per-category model capabilities using diffusion-synthesized images; (3) fusing point features from multiple models guided by the constructed capability estimates; and (4) transferring the fused knowledge to a 3D network via superpoint distillation and temporal ensembling self-distillation.
Key Designs¶
-
Model Alignment via Text: Different 2D models reside in different feature spaces. The core mechanism uses text as a unified bridge. For each mask output by SEEM, a pre-trained captioner (TAP) first generates a descriptive caption (including color, shape, etc.), after which nouns in the caption are replaced with SEEM's predicted labels to improve semantic accuracy. A CLIP text encoder then uniformly encodes these captions. This aligns LSeg (natively aligned to CLIP) and SEEM (bridged to CLIP via text) into a shared CLIP feature space. Pixel-to-point correspondences are used to map pixel features to 3D points, yielding aligned point features \(F^{2D}_L\) and \(F^{2D}_S\).
-
Annotation-Free Model Capability Construction: This is the most innovative component. The challenge is evaluating model capabilities without test data or annotations. The solution proceeds as follows:
- Stable Diffusion is used to generate \(m\) synthetic images for each category in a predefined vocabulary.
- Cross-attention maps \(M_x\) from the diffusion model localize the target object, and SAM refines them into precise masks serving as pseudo-labels \(\mathbf{M}^{Pseudo}_{i,j}\).
- LSeg and SEEM are applied to the synthetic images to obtain their respective mask predictions.
- mIoU measures each model's recognition capability per category: \(S^{LSeg}_j = \frac{1}{m}\sum_{i=1,...,m} \mathbf{mIoU}(\mathbf{M}^{Pseudo}_{i,j}, \mathbf{M}^{LSeg}_{i,j})\)
- A capability vector for each model is constructed as \(S_L = [S^{LSeg}_1, ..., S^{LSeg}_K]\).
-
Capability-Guided Feature Fusion:
- CLIP encodes the vocabulary to obtain text features \(F_{text}\).
- The predicted category for each point under LSeg and SEEM is computed as \(\mathcal{P}_{LSeg}\) and \(\mathcal{P}_{SEEM}\).
- The sum of capability scores for the two candidate categories serves as the probability of each model making a correct prediction.
- Temperature-controlled softmax weighting fuses the features: \(F^{2D}_{fusion} = \frac{\exp(\mathcal{P}_{LSeg}/\tau)}{\exp(\mathcal{P}_{LSeg}/\tau) + \exp(\mathcal{P}_{SEEM}/\tau)} F^{2D}_L + \frac{\exp(\mathcal{P}_{SEEM}/\tau)}{\exp(\mathcal{P}_{LSeg}/\tau) + \exp(\mathcal{P}_{SEEM}/\tau)} F^{2D}_S\)
-
Superpoint Distillation + Temporal Ensembling Self-Distillation:
- Superpoint distillation: Semantically consistent superpoints are extracted, their features averaged, and distillation is performed at the superpoint level, smoothing inconsistent 2D predictions.
- Temporal ensembling self-distillation: An EMA accumulates historical outputs \(\hat{F}^{3D} = \alpha \hat{F}^{3D} + (1-\alpha) F^{3D}\); pseudo-labels derived from historical predictions supervise the current model. This is more stable than GGSD's Mean-teacher, which risks training collapse when a variable student model supervises the teacher.
Loss & Training¶
Distillation stage: - Superpoint distillation loss: \(\mathcal{L} = \mathcal{L}_p + \mathcal{L}_{sp}\), computing cosine similarity loss at both point and superpoint levels. - Self-distillation loss: \(\mathcal{L} = \mathcal{L}^{ST}_p + \mathcal{L}^{ST}_{sp}\), using cross-entropy loss.
Training runs for 100 epochs: superpoint distillation only for the first 70 epochs, with self-distillation added for the final 30. Different pre-built vocabularies are used for indoor and outdoor scenes. For nuScenes, where point clouds are predominantly road-centric, superpoints are not used (each point is treated as its own superpoint).
Key Experimental Results¶
Main Results¶
| Dataset | Metric (mIoU) | Ours | Prev. SOTA (OV3D) | Gain |
|---|---|---|---|---|
| ScanNet v2 | mIoU | 61.9 | 57.3 | +4.6 |
| Matterport3D | mIoU | 48.6 | 45.8 | +2.8 |
| nuScenes | mIoU | 47.5 | 44.6 (Seal=45.0) | +2.5 |
SAS achieves state-of-the-art performance among zero-shot methods on all three datasets. The mIoU of 61.9 on ScanNet v2 approaches that of some older fully supervised methods (e.g., PointConv at 61.0).
Ablation Study¶
| Configuration | ScanNet v2 | Matterport | nuScenes | Note |
|---|---|---|---|---|
| 2D LSeg features | 51.2 | 38.6 | - | LSeg only |
| 2D SEEM features | 47.3 | 40.2 | 37.8 | SEEM only |
| 2D direct addition | 48.6 | 39.1 | 34.1 | Naive fusion ineffective |
| 2D linear fusion | 49.9 | 39.4 | 34.8 | Limited gain |
| 2D proposed fusion | 55.5 | 43.6 | 40.0 | Capability-guided significantly better |
| 3D pixel-point distillation | 56.7 | 45.1 | 45.4 | Distillation surpasses 2D features |
| 3D superpoint distillation | 59.2 | 46.3 | 45.4 | Structural information beneficial |
| 3D +self-distillation (full) | 61.9 | 48.6 | 47.5 | Further gains |
Extending to 3 2D models (+ODISE): ScanNet 62.5 (+0.6), Matterport 49.8 (+1.2), demonstrating scalability.
Long-tail evaluation: SAS significantly outperforms OpenScene under Matterport K=40/80/160 category settings (e.g., K=160: 8.1 vs. 5.8).
Key Findings¶
- SEEM outperforms LSeg on Matterport3D but is weaker on ScanNet v2, confirming genuine complementarity between the two models.
- Naive addition or linear fusion can perform worse than the best single model, demonstrating the necessity of capability-guided fusion.
- The distilled 3D model consistently outperforms 2D fused features, reflecting gains from 3D spatial consistency.
- Superpoint distillation and self-distillation each contribute approximately 2–3 mIoU points independently.
- The method's advantage is more pronounced under long-tail categories (K=160), reflecting strong open-vocabulary generalization.
Highlights & Insights¶
- Core innovation: Using diffusion-synthesized images to quantify teacher model capabilities resolves the fundamental challenge of evaluating models in zero-shot settings—an elegant and broadly applicable idea.
- Text-bridged alignment is an elegant solution to the feature space incompatibility problem across multiple models.
- Temporal ensembling self-distillation is more stable than GGSD's Mean-teacher, avoiding training collapse caused by supervising the teacher with a varying student.
- At inference time, only the 3D model output is used directly, without 2D–3D ensembling, yielding higher efficiency.
Limitations & Future Work¶
- Synthetic image quality is bounded by Stable Diffusion's generation capability, which may be insufficient for fine-grained categories.
- Cross-attention map localization is not always precise, and SAM refinement may also introduce errors.
- Only 2–3 2D models are fused; fusion strategies and efficiency for larger model ensembles remain unexplored.
- The quality of superpoint extraction (e.g., color/normal clustering) affects distillation performance.
- The possibility of using 3D geometric information to guide 2D feature selection has not been explored.
Related Work & Insights¶
- OpenScene established the 2D→3D feature distillation paradigm; SAS makes a significant contribution along the direction of improving 2D teacher features.
- GGSD's Mean-teacher self-distillation inspired SAS's temporal ensembling design.
- Cross-attention maps from Stable Diffusion are cleverly repurposed for object localization, extending the utility of diffusion models as general-purpose tools.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to propose annotation-free model capability construction; highly innovative.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three major datasets, long-tail evaluation, multi-task extension, and comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with well-articulated motivation.
- Value: ⭐⭐⭐⭐⭐ The multi-teacher fusion paradigm has broad implications for open-vocabulary 3D understanding.