Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion¶

Paper Information¶

Conference: ICCV 2025
arXiv: 2507.08555
Code: https://github.com/Enyu-Liu/DISC
Area: 3D Perception / Semantic Scene Completion
Keywords: semantic scene completion, BEV, instance-scene disentanglement, dual-stream, autonomous driving

TL;DR¶

This paper proposes DISC, a category-aware dual-stream architecture for 3D semantic scene completion that disentangles instance and scene categories into separate query streams with dedicated decoding modules. Using only single-frame input, DISC surpasses multi-frame state-of-the-art methods on SemanticKITTI, achieving a 17.9% improvement in instance category mIoU.

Background & Motivation¶

3D Semantic Scene Completion (SSC) aims to jointly predict scene geometry and semantics from sparse inputs. Existing voxel-based methods (VoxFormer, CGFormer, Symphonies, etc.) suffer from three key limitations:

Poor instance category prediction: Occlusion and projection errors cause category misses and semantic ambiguity (e.g., pedestrians being misidentified).

Incoherent scene category structure: Out-of-view regions introduce topological errors (e.g., roads appearing in terrain areas).

Inherent limitations of voxel-based approaches: Using voxels as the basic interaction unit destroys category-level structural information, and unified modules struggle to address the divergent challenges of instances and scenes.

Core observation: Instance categories (car, person, bicycle) and scene categories (road, building, vegetation) face fundamentally different challenges and require differentiated processing strategies. DISC is the first to propose a category-level dual-stream paradigm to systematically address this problem.

Method¶

Overall Architecture¶

DISC operates in BEV space and consists of two core modules:

Discriminative Query Generator (DQG): Generates queries with geometric and semantic priors separately for instances and scenes.
Dual-Attention Category Decoder (DACD): Provides dedicated decoding layers tailored to the distinct challenges of instances and scenes.

Key Design 1: Discriminative Query Generator (DQG)¶

Coarse-to-fine BEV generation: Coarse voxel features \(V_{\text{coarse}}\) are produced via LSS lift, refined into \(V_{\text{fine}}\) through depth-guided surface voxel refinement, and Z-axis max-pooling yields BEV features \(C\).

Instance queries: Potential instances are detected in image space via segmentation and projected to BEV. A \(k \times k\) neighborhood suppression strategy selects \(N_{\text{ins}}\) reference locations, and BEV features are sampled at these locations to initialize instance queries:

\[X_{\text{ins}} = \{CT(\mathbf{g}_n) \mid \mathbf{g}_n \in \text{Top-}N(\{\text{Max}(B_{k \times k}^i)\}_{i=1}^s)\}\]

\[\mathbf{Q}_{\text{ins}} = C[X_{\text{ins}}]\]

Scene queries: A patch design captures continuous spatial distributions. BEV features are divided into equal-sized patches, with each patch center serving as a reference point; convolutional upsampling compresses each patch to \(1 \times 1\) to initialize scene queries.

Key Design 2: Adaptive Instance Layer (AIL)¶

Addresses the loss of height information and occlusion issues for instance categories:

Adaptive height sampling: For each instance query \(q_{\text{ins}}\), \(N\) most probable heights are predicted to form 3D reference points \(P_j = (x_{\text{ins}}, h_j)\).
Image cross-attention: Reference points are projected to image space, and multi-scale features are sampled via deformable cross-attention:

\[q_{\text{ins}} = \sum_{j=1}^N w_j \text{DA}(q_{\text{ins}}, F^{2D}, \mathcal{T}^{WI}(x_{\text{ins}}, h_j))\]

Scene context fusion: Instance queries extract region-of-interest information from scene features (e.g., "a cylindrical object on a sidewalk is more likely a traffic sign than a tree trunk").
UNet propagation: Instance features are propagated across the entire BEV plane via a UNet network.

Key Design 3: Global Scene Layer (GSL)¶

Addresses insufficient global reasoning for scene categories:

Global semantic aggregation: \(\mathbf{Q}_{\text{img}}\) is constructed from the smallest-scale image features, and global information is aggregated via cross-attention.
Random masking: A subset of \(\mathbf{Q}_{\text{img}}\) is dropped to simulate missing occlusion information, encouraging the network to reason about scene layout.
Self-attention: Expands the global receptive field, propagating visible-region features to distant and out-of-view areas.

Feature Fusion and Loss Function¶

Category-disentangled height prediction fusion:

\[V = (C_{\text{ins}} \bigotimes H_{\text{ins}}) + (C_{\text{scn}} \bigotimes H_{\text{scn}})\]

The total loss comprises the SSC loss (Scene-Class Affinity + Cross-Entropy), BEV segmentation/height augmentation loss, and depth loss:

\[\mathcal{L}_{total} = \lambda_1 \mathcal{L}_{ssc} + \lambda_2 \mathcal{L}_{aug} + \lambda_d \mathcal{L}_d\]

Key Experimental Results¶

Main Results: SemanticKITTI Test Set¶

Method	Input	IoU	InsM	ScnM	mIoU
VoxFormer-T	Multi-frame	43.21	4.79	22.97	13.41
HTCL	Multi-frame	44.23	6.48	28.86	17.09
VoxFormer-S	Single-frame	42.95	4.39	20.89	12.20
Symphonize	Single-frame	42.19	6.14	24.93	15.04
CGFormer	Single-frame	44.41	6.15	28.24	16.63
DISC (ours)	Single-frame	45.32	7.25	28.56	17.35

Single-frame DISC outperforms all multi-frame methods in mIoU (17.35 vs. HTCL 17.09).
Instance mIoU (InsM) reaches 7.25, a 17.9% gain over single-frame SOTA and 11.9% over multi-frame SOTA.
On SSCBench-KITTI-360, mIoU reaches 20.55, surpassing all camera-based and LiDAR-based methods.

Ablation Study (SemanticKITTI val)¶

Configuration	IoU	InsM	ScnM	mIoU
Baseline (VoxFormer)	43.13	3.59	—	12.18
+ DQG (instance queries)	43.78	5.22	—	14.31
+ DQG (scene queries)	44.01	3.98	—	14.76
+ DACD (AIL + GSL)	44.85	6.15	—	16.10
Full DISC	45.32	7.25	28.56	17.35

Key Findings: - Discriminative queries (replacing voxel queries) account for the majority of instance performance gains. - The targeted design of the dual-attention decoder further improves InsM to 7.25. - Training requires only 20 epochs, fewer than most existing methods, indicating faster convergence.

Highlights & Insights¶

First category-level paradigm: Shifts from voxel-level to category-level interaction, fundamentally reframing the SSC processing paradigm.
Feasibility of BEV-space disentanglement: Instance-scene decoupling mitigates feature entanglement along the height axis in BEV (e.g., height ambiguity when pedestrians and roads share the same BEV cell).
Single-frame surpassing multi-frame: Demonstrates that fully exploiting single-frame category information holds greater potential than naive multi-frame fusion.
Scene context aiding instance reasoning: Validates the effectiveness of spatial priors such as "a cylindrical object on a sidewalk → traffic sign."

Limitations & Future Work¶

Performance depends on the quality of the pretrained MaskDINO instance segmentation; segmentation failures degrade instance query initialization.
Height prediction in BEV space remains approximate; extreme height differences (e.g., tall buildings vs. ground) may not be captured accurately.
Validation is limited to KITTI-series datasets; generalization to large-scale benchmarks such as nuScenes or Waymo has not been explored.

Relation to Symphonies: Symphonies implicitly leverages instance priors, whereas DISC explicitly disentangles instance and scene streams.
The instance query design in DQG (image-space detection → BEV projection → neighborhood suppression) is generalizable to other BEV perception tasks.
The dual-stream architecture conceptually parallels the differentiated handling of target types in the DETR family of methods.

Rating¶

⭐⭐⭐⭐⭐ — The work offers deep problem insight (instance vs. scene divergent challenges), an elegant solution (category-level dual-stream), outstanding experimental results (single-frame surpassing multi-frame), and a paradigm-level contribution to the SSC task.