Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion¶
Paper Information¶
- Conference: ICCV 2025
- arXiv: 2507.08555
- Code: https://github.com/Enyu-Liu/DISC
- Area: 3D Perception / Semantic Scene Completion
- Keywords: semantic scene completion, BEV, instance-scene disentanglement, dual-stream, autonomous driving
TL;DR¶
This paper proposes DISC, a category-aware dual-stream architecture for 3D semantic scene completion that disentangles instance and scene categories into separate query streams with dedicated decoding modules. Using only single-frame input, DISC surpasses multi-frame state-of-the-art methods on SemanticKITTI, achieving a 17.9% improvement in instance category mIoU.
Background & Motivation¶
3D Semantic Scene Completion (SSC) aims to jointly predict scene geometry and semantics from sparse inputs. Existing voxel-based methods (VoxFormer, CGFormer, Symphonies, etc.) suffer from three key limitations:
Poor instance category prediction: Occlusion and projection errors cause category misses and semantic ambiguity (e.g., pedestrians being misidentified).
Incoherent scene category structure: Out-of-view regions introduce topological errors (e.g., roads appearing in terrain areas).
Inherent limitations of voxel-based approaches: Using voxels as the basic interaction unit destroys category-level structural information, and unified modules struggle to address the divergent challenges of instances and scenes.
Core observation: Instance categories (car, person, bicycle) and scene categories (road, building, vegetation) face fundamentally different challenges and require differentiated processing strategies. DISC is the first to propose a category-level dual-stream paradigm to systematically address this problem.
Method¶
Overall Architecture¶
DISC operates in BEV space and consists of two core modules:
- Discriminative Query Generator (DQG): Generates queries with geometric and semantic priors separately for instances and scenes.
- Dual-Attention Category Decoder (DACD): Provides dedicated decoding layers tailored to the distinct challenges of instances and scenes.
Key Design 1: Discriminative Query Generator (DQG)¶
Coarse-to-fine BEV generation: Coarse voxel features \(V_{\text{coarse}}\) are produced via LSS lift, refined into \(V_{\text{fine}}\) through depth-guided surface voxel refinement, and Z-axis max-pooling yields BEV features \(C\).
Instance queries: Potential instances are detected in image space via segmentation and projected to BEV. A \(k \times k\) neighborhood suppression strategy selects \(N_{\text{ins}}\) reference locations, and BEV features are sampled at these locations to initialize instance queries:
Scene queries: A patch design captures continuous spatial distributions. BEV features are divided into equal-sized patches, with each patch center serving as a reference point; convolutional upsampling compresses each patch to \(1 \times 1\) to initialize scene queries.
Key Design 2: Adaptive Instance Layer (AIL)¶
Addresses the loss of height information and occlusion issues for instance categories:
- Adaptive height sampling: For each instance query \(q_{\text{ins}}\), \(N\) most probable heights are predicted to form 3D reference points \(P_j = (x_{\text{ins}}, h_j)\).
- Image cross-attention: Reference points are projected to image space, and multi-scale features are sampled via deformable cross-attention:
- Scene context fusion: Instance queries extract region-of-interest information from scene features (e.g., "a cylindrical object on a sidewalk is more likely a traffic sign than a tree trunk").
- UNet propagation: Instance features are propagated across the entire BEV plane via a UNet network.
Key Design 3: Global Scene Layer (GSL)¶
Addresses insufficient global reasoning for scene categories:
- Global semantic aggregation: \(\mathbf{Q}_{\text{img}}\) is constructed from the smallest-scale image features, and global information is aggregated via cross-attention.
- Random masking: A subset of \(\mathbf{Q}_{\text{img}}\) is dropped to simulate missing occlusion information, encouraging the network to reason about scene layout.
- Self-attention: Expands the global receptive field, propagating visible-region features to distant and out-of-view areas.
Feature Fusion and Loss Function¶
Category-disentangled height prediction fusion:
The total loss comprises the SSC loss (Scene-Class Affinity + Cross-Entropy), BEV segmentation/height augmentation loss, and depth loss:
Key Experimental Results¶
Main Results: SemanticKITTI Test Set¶
| Method | Input | IoU | InsM | ScnM | mIoU |
|---|---|---|---|---|---|
| VoxFormer-T | Multi-frame | 43.21 | 4.79 | 22.97 | 13.41 |
| HTCL | Multi-frame | 44.23 | 6.48 | 28.86 | 17.09 |
| VoxFormer-S | Single-frame | 42.95 | 4.39 | 20.89 | 12.20 |
| Symphonize | Single-frame | 42.19 | 6.14 | 24.93 | 15.04 |
| CGFormer | Single-frame | 44.41 | 6.15 | 28.24 | 16.63 |
| DISC (ours) | Single-frame | 45.32 | 7.25 | 28.56 | 17.35 |
- Single-frame DISC outperforms all multi-frame methods in mIoU (17.35 vs. HTCL 17.09).
- Instance mIoU (InsM) reaches 7.25, a 17.9% gain over single-frame SOTA and 11.9% over multi-frame SOTA.
- On SSCBench-KITTI-360, mIoU reaches 20.55, surpassing all camera-based and LiDAR-based methods.
Ablation Study (SemanticKITTI val)¶
| Configuration | IoU | InsM | ScnM | mIoU |
|---|---|---|---|---|
| Baseline (VoxFormer) | 43.13 | 3.59 | — | 12.18 |
| + DQG (instance queries) | 43.78 | 5.22 | — | 14.31 |
| + DQG (scene queries) | 44.01 | 3.98 | — | 14.76 |
| + DACD (AIL + GSL) | 44.85 | 6.15 | — | 16.10 |
| Full DISC | 45.32 | 7.25 | 28.56 | 17.35 |
Key Findings: - Discriminative queries (replacing voxel queries) account for the majority of instance performance gains. - The targeted design of the dual-attention decoder further improves InsM to 7.25. - Training requires only 20 epochs, fewer than most existing methods, indicating faster convergence.
Highlights & Insights¶
- First category-level paradigm: Shifts from voxel-level to category-level interaction, fundamentally reframing the SSC processing paradigm.
- Feasibility of BEV-space disentanglement: Instance-scene decoupling mitigates feature entanglement along the height axis in BEV (e.g., height ambiguity when pedestrians and roads share the same BEV cell).
- Single-frame surpassing multi-frame: Demonstrates that fully exploiting single-frame category information holds greater potential than naive multi-frame fusion.
- Scene context aiding instance reasoning: Validates the effectiveness of spatial priors such as "a cylindrical object on a sidewalk → traffic sign."
Limitations & Future Work¶
- Performance depends on the quality of the pretrained MaskDINO instance segmentation; segmentation failures degrade instance query initialization.
- Height prediction in BEV space remains approximate; extreme height differences (e.g., tall buildings vs. ground) may not be captured accurately.
- Validation is limited to KITTI-series datasets; generalization to large-scale benchmarks such as nuScenes or Waymo has not been explored.
Related Work & Insights¶
- Relation to Symphonies: Symphonies implicitly leverages instance priors, whereas DISC explicitly disentangles instance and scene streams.
- The instance query design in DQG (image-space detection → BEV projection → neighborhood suppression) is generalizable to other BEV perception tasks.
- The dual-stream architecture conceptually parallels the differentiated handling of target types in the DETR family of methods.
Rating¶
⭐⭐⭐⭐⭐ — The work offers deep problem insight (instance vs. scene divergent challenges), an elegant solution (category-level dual-stream), outstanding experimental results (single-frame surpassing multi-frame), and a paradigm-level contribution to the SSC task.