Zero-Shot Multi-Object Scene Completion¶
Conference: ECCV 2024
arXiv: 2403.14628
Code: Project Page
Area: 3D Vision
TL;DR¶
OctMAE is proposed, a hybrid architecture fusing Octree U-Net and latent 3D MAE to achieve high-quality, near-real-time multi-object scene shape completion from a single RGB-D image. Efficiency and generalization are significantly enhanced via an occlusion-masking strategy and 3D Rotary Position Embedding (RoPE).
Background & Motivation¶
- Existing single-object shape completion methods perform poorly in complex, multi-object real-world scenes.
- Object-centric methods require class-specific priors, limiting them to a few categories.
- VoxFormer extends MAE to 3D but uses dense voxels (memory-constrained to low resolutions).
- Lack of large-scale, multi-class 3D scene completion datasets.
- Core Problem: How to achieve zero-shot multi-object scene completion across a wide range of object categories.
Method¶
Overall Architecture¶
- Features are extracted from a 2D image using a pre-trained ResNeXt-50, which are then back-projected to 3D using the depth map to obtain features and coordinates.
- The 3D point features are converted into an octree representation (LoD-9, \(512^3\) resolution).
- An Octree U-Net encodes the features into an LoD-5 latent space.
- A latent 3D MAE processes the encoded features and occlusion mask tokens.
- An Octree U-Net decoder restores the features to LoD-9, predicting occupancy hierarchically and SDF/normals at the final layer.
Key Designs¶
OctMAE Architecture (Core Innovation): - Octree U-Net handles efficient local feature encoding/decoding (from LoD-9 to LoD-5). - 3D MAE performs global reasoning in the LoD-5 latent space (reducing the token count to hundreds or thousands). - It combines the local perception of CNNs with the global understanding of Transformers.
Occlusion-Masking Strategy: - Instead of placing mask tokens at all empty voxels (dense masking \(\to\) memory explosion), mask tokens are only placed at occluded voxels. - Occluded voxels are identified by determining which voxels are behind objects via depth testing. - This drastically reduces the number of mask tokens, enabling the use of full attention instead of deformable attention.
3D RoPE (Rotary Position Embedding): - The 3D coordinates are encoded separately into rotation matrices \(R(p^x)\), \(R(p^y)\), and \(R(p^z)\). - These matrices form a block-diagonal matrix applied to the query/key of each attention layer. - This is more efficient than learnable relative position embedding (which requires \(N' \times N'\) calculations) and more generalizable than absolute position embedding.
Large-Scale Dataset Construction: - Over 12k 3D models across 601 classes are selected from Objaverse, supplemented by GSO data. - BlenderProc is used to perform physical placement and realistic lighting to render 1M images. - This covers hand-sized objects (4–40 cm), filling the gap in zero-shot multi-object scene completion datasets.
Loss & Training¶
Normal L2 loss + SDF L2 loss + binary cross-entropy occupancy loss for each LoD. Empty voxels are pruned hierarchically to avoid unnecessary computation.
Key Experimental Results¶
Main Results¶
| Method | 3D Representation | Synthetic CD↓ | YCB-V CD↓ | HB CD↓ | HOPE CD↓ |
|---|---|---|---|---|---|
| VoxFormer | Dense | 44.54 | 30.32 | 34.84 | 47.75 |
| ConvONet | Dense | 23.68 | 32.87 | 26.71 | 20.95 |
| MCC | Implicit | 43.37 | 35.85 | 19.59 | 17.53 |
| AICNet | Dense | 15.64 | 12.26 | 11.87 | 11.40 |
| Minkowski | Sparse | 11.47 | 8.04 | 8.81 | 8.56 |
| OCNN | Sparse | 9.05 | 7.10 | 7.02 | 8.05 |
| OctMAE | Sparse | 6.48 | 6.40 | 6.14 | 6.97 |
Ablation Study¶
Position encoding comparison (synthetic dataset):
| Type | CD↓ | F1↑ | NC↑ |
|---|---|---|---|
| No PE | 11.32 | 0.778 | 0.808 |
| CPE | 9.91 | 0.785 | 0.811 |
| APE | 8.61 | 0.782 | 0.825 |
| RPE | 7.81 | 0.804 | 0.830 |
| RoPE | 6.48 | 0.839 | 0.848 |
3D attention mechanism comparison (HOPE dataset):
| Method | Occlusion Masking | CD↓ | F1↑ |
|---|---|---|---|
| 3D Deformable | ✗ | 12.14 | 0.703 |
| Neighbor Attn | ✗ | 9.26 | 0.727 |
| Octree Attn | ✗ | 7.99 | 0.752 |
| Octree Attn | ✓ | 7.54 | 0.772 |
| Full Attn | ✓ | 6.97 | 0.803 |
Key Findings¶
- OctMAE achieves state-of-the-art performance across all four datasets and can generalize zero-shot to real scenes even when trained only on synthetic data.
- 3D RoPE makes a significant contribution to performance, reducing Chamfer Distance (CD) from 11.32 to 6.48.
- The occlusion-masking strategy makes full attention feasible, which significantly outperforms deformable attention.
- Sparse representations (Octree/Minkowski) comprehensively outperform dense/implicit representations.
- Latent-space 3D MAE is the key to generalization; adding MAE to the same U-Net architecture (Minkowski/OCNN) yields substantial improvements.
Highlights & Insights¶
- Performing MAE in the latent space is an effective strategy to scale up 3D Transformers, as the token count remains controllable at LoD-5.
- Occlusion masking is an elegant design tailored for scene completion tasks: tokens are placed only where generation is most needed.
- The application of 3D RoPE in 3D vision is highly noteworthy, offering greater efficiency and better generalization than traditional position encodings.
- The creation of the large-scale Objaverse-derived dataset is an independent and valuable contribution to the community.
Limitations & Future Work¶
- Requires foreground segmentation masks as input.
- Scene completion remains challenging under extreme occlusion (where only a tiny fraction of the object is visible).
- The octree representation may lose details on extremely thin structures like cables.
- Achieves near-real-time performance but is not yet strictly real-time.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Hybrid Octree + MAE architecture and occlusion-masking strategy.
- Effectiveness: ⭐⭐⭐⭐⭐ — Comprehensive state-of-the-art performance and zero-shot generalization.
- Practicality: ⭐⭐⭐⭐ — Addresses practical needs of robotic scene understanding.
- Recommendation: ⭐⭐⭐⭐⭐