Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction¶
Conference: ICCV 2025 arXiv: 2507.18331 Code: https://github.com/RM-Zhang/SGCDet Area: 3D Vision Keywords: multi-view 3D object detection, indoor scene, sparse voxel construction, deformable attention, occupancy prediction
TL;DR¶
SGCDet achieves state-of-the-art performance in multi-view indoor 3D object detection without ground-truth geometric supervision, through a geometry- and context-aware aggregation module (3D deformable attention + multi-view attention fusion) and an occupancy-probability-based sparse voxel construction strategy, while substantially reducing computational overhead.
Background & Motivation¶
Indoor 3D object detection is a core capability for embodied AI and AR/VR applications. Traditional methods rely on expensive 3D sensors to acquire point clouds; recent work has shifted toward multi-view image-based 3D detection. The central challenge lies in constructing high-quality 3D voxel representations from 2D images.
Prior methods exhibit two critical bottlenecks:
- Limited feature sampling: Methods such as ImVoxelNet project each voxel onto the image for single-point sampling, resulting in an extremely limited receptive field and inability to handle occlusion. Subsequent methods (CN-RMA, MVSDet) introduce explicit geometric constraints but either rely on GT geometry or incur prohibitive computational cost.
- Dense voxel redundancy: Existing methods construct complete dense 3D voxel grids, yet indoor scenes consist largely of free space, leading to severe computational waste.
Core Problem¶
How can the quality of 2D-to-3D feature projection be improved (addressing single-point sampling and occlusion) while reducing redundant computation in the 3D volume representation, without relying on ground-truth scene geometry?
Method¶
Overall Architecture¶
SGCDet comprises three components: (1) an image backbone (ResNet-50 + FPN) for 2D feature extraction; (2) a view transformation module that lifts 2D features into 3D voxels; and (3) a detection head for 3D bounding box prediction. The core innovations reside in the view transformation module and consist of two key designs.
Key Designs¶
Design 1: Geometry- and Context-Aware Aggregation (GCA)
Rather than projecting voxel centers onto the image for single-point sampling, SGCDet performs adaptive aggregation in two steps:
-
Intra-view Feature Sampling: A DepthNet first estimates the depth distribution; 2D features are lifted into a 3D pixel-space representation via outer product, \(\mathbf{F}_n^{3D} = \mathbf{F}_n^{2D} \otimes \mathbf{D}_n\). Instead of sampling at the projected location directly, the projected-point feature serves as a query, and 3D deformable attention aggregates geometric and contextual information within a local neighborhood. Ablation studies show that 3D deformable attention substantially outperforms its 2D counterpart, since 2D deformable attention suffers from depth ambiguity.
-
Inter-view Feature Fusion: Appearance and scale vary considerably across viewpoints, making simple averaging suboptimal. SGCDet uses the mean-pooled feature over all views as the query and each view's feature as key/value, dynamically weighting each view's contribution via standard attention.
Compared to the view-agnostic queries in DFA3D, SGCDet uses view-specific queries for intra-view aggregation, which is better suited to the large camera pose variation characteristic of indoor scenes.
Design 2: Sparse Volume Construction
A coarse-to-fine strategy is adopted to progressively upsample voxels:
- A low-resolution (e.g., \(10\times10\times4\)) coarse voxel grid is constructed first.
- Over \(L\) stages, the resolution is doubled at each stage:
- A lightweight occupancy prediction head estimates the occupancy probability of each voxel.
- Only the top-\(k\)% (default 25%) voxels by occupancy probability are selected for GCA feature refinement.
- A residual connection is applied: \(\mathbf{V}_l = \mathbf{V}_l^{init} + \mathcal{P}(\mathbf{P}_l, \{\mathbf{F}_n^{2D}\}, \{\mathbf{D}_n\})\)
Occupancy supervision design: Rather than requiring GT scene geometry, pseudo-labels are generated from 3D bounding boxes — voxels inside boxes are labeled 1, otherwise 0. Although these pseudo-labels are noisy (not all space inside a box is occupied), the top-25% selection strategy at inference is sufficient to cover regions with actual objects.
DepthNet: Multi-view depth features (built via plane sweep cost volume) and monocular depth features (capturing image-level detail) are concatenated and decoded to produce the depth distribution, with the two branches complementing each other.
Loss & Training¶
- \(\mathcal{L}_{det}\): anchor-free detection loss = centerness CE loss + IoU loss + classification focal loss
- \(\mathcal{L}_{occ}\): sum of BCE losses on occupancy probabilities across all stages
Training configuration: AdamW optimizer, lr = 0.0002, cosine decay; 12 epochs on ScanNet/ARKitScenes, 30 epochs on ScanNet200; 40 images during training, 100 images during testing.
Key Experimental Results¶
| Dataset | Metric | SGCDet | Prev. SOTA (MVSDet) | Gain |
|---|---|---|---|---|
| ScanNet | mAP@0.25 | 61.2 | 56.2 | +5.0 |
| ScanNet | mAP@0.50 | 35.2 | 31.3 | +3.9 |
| ARKitScenes | mAP@0.25 | 62.3 | 60.7 | +1.6 |
| ARKitScenes | mAP@0.50 | 44.7 | 40.1 | +4.6 |
| ScanNet200 (SGCDet-L) | mAP@0.25 | 28.9 | ImGeoNet 22.3 | +6.6 |
Computational efficiency vs. MVSDet: training memory ↓42.9% (20 vs. 35 GB), training time ↓47.2% (19 vs. 36 h), inference memory ↓50% (14 vs. 28 GB), FPS 1.46 vs. 0.87 (↑67.8%).
SGCDet-L achieves 70.4/57.0 on ARKitScenes, surpassing even CN-RMA (67.6/56.5), which uses GT geometric supervision.
Ablation Study¶
- 3D vs. 2D deformable attention: The 3D variant yields +3.5/+4.3 mAP gain, while the 2D variant contributes only +0.2/+0.7, confirming that performing deformable attention in 3D pixel space jointly captures geometry and context.
- Multi-view attention: Adds +1.7/+1.1 on top of 3D deformable attention, validating the value of dynamic view weighting.
- Selection ratio: 25% vs. 100% yields nearly identical performance (61.2 vs. 61.0) while reducing memory from 31 to 20 GB. 10% is too aggressive and causes a large performance drop (−4.2 mAP@0.25).
- Occupancy loss is essential: Removing it causes a sharp drop of −6.7/−6.2, demonstrating that explicit occupancy supervision is critical for sparse construction.
- Depth quality upper bound: Adding depth supervision yields 62.2/37.1; GT depth yields 64.3/42.3, suggesting that improved depth estimation can further boost performance.
- Robustness to annotation noise: With 15% random box dropping and 15% random scaling, SGCDet drops only 0.5/1.6, compared to 0.8/2.2 for ImGeoNet.
Highlights & Insights¶
- Using bounding boxes as occupancy pseudo-labels is a clever trick: It avoids dependency on GT geometry while providing sufficiently informative occupancy supervision. The top-\(k\) selection strategy at inference further compensates for the imprecision of pseudo-labels.
- 3D deformable attention substantially outperforms its 2D counterpart: While performing deformable sampling in 2D feature maps may seem more natural, experiments demonstrate that adaptive sampling along the depth dimension is necessary to genuinely resolve depth ambiguity.
- Sparsification yields across-the-board gains: A 25% selection ratio dramatically reduces resource consumption while preserving accuracy — highly valuable for real-world deployment.
- DepthNet's dual-branch design: The multi-view branch provides geometrically consistent cross-view constraints, while the monocular branch preserves fine image-level structural detail; the two branches are mutually complementary.
Limitations & Future Work¶
- Depth estimation remains a significant bottleneck: The GT depth upper bound (64.3/42.3 vs. 61.2/35.2) indicates that the model is limited by depth estimation accuracy; incorporating stronger depth estimation modules or pretrained depth foundation models is a promising direction.
- Coarseness of pseudo-labels: Much of the space inside bounding boxes is actually free; generating more accurate pseudo-occupancy labels via shape priors or self-supervised signals could yield further improvements.
- Fixed top-\(k\) selection: The 25% ratio is fixed, yet object density varies considerably across scenes; an adaptive selection strategy may be more appropriate.
- Larger backbones and pretrained weights unexplored: The current model uses ResNet-50; Vision Transformers or larger pretrained models may provide additional gains in feature representation capacity.
Related Work & Insights¶
| Method | GT Geometry Required | Feature Lifting Strategy | Efficiency |
|---|---|---|---|
| ImVoxelNet | ✗ | Ray-based average | High, but low accuracy |
| ImGeoNet | ✓ | Ray + opacity post-processing | Medium |
| NeRF-Det | ✗ | NeRF ray + opacity post-processing | Medium |
| CN-RMA | ✓ | TSDF-guided + multi-stage | Low (243 h training) |
| MVSDet | ✗ | MVS depth + 3DGS self-supervision | Low (35 GB memory) |
| SGCDet | ✗ | 3D deform. attn + sparse construction | High (20 GB, 1.46 FPS) |
SGCDet comprehensively outperforms all methods that do not use GT geometry, and its efficiency approaches or surpasses methods that do.
Connection to My Research¶
- The occupancy prediction and sparsification paradigm is transferable to autonomous driving 3D occupancy prediction (e.g., open-vocabulary 3D occupancy prediction), and the pseudo-label generation strategy offers a reference for scenarios lacking dense GT annotations.
- The design philosophy of 3D deformable attention — performing adaptive sampling in the lifted 3D space rather than on the 2D plane — is applicable to any task requiring modeling of depth uncertainty.
- The multi-view + monocular dual-branch fusion pattern in DepthNet constitutes a general depth estimation paradigm reusable in other multi-view 3D understanding tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of 3D deformable attention and bbox-based pseudo-occupancy labels is sufficiently novel; the removal of GT geometry dependency has clear engineering and academic value.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, detailed ablations (five groups), computational cost analysis, robustness experiments, and visualizations — extremely comprehensive.
- Writing Quality: ⭐⭐⭐⭐ Well-structured, well-motivated, with intuitive figures and rigorous mathematical notation.
- Value: ⭐⭐⭐ The occupancy prediction and sparsification strategies are transferable, though the connection to the primary current research direction is moderate.