FastCAD: Real-Time CAD Retrieval and Alignment from Scans and Videos¶
Conference: ECCV 2024
arXiv: 2403.15161
Code: No public code
Area: 3D Vision
Keywords: CAD model retrieval, 3D alignment, embedding distillation, real-time 3D reconstruction, contrastive learning
TL;DR¶
This work proposes FastCAD to achieve CAD model retrieval and alignment for all objects in a scene within 50ms through contrastive learning embedding space distillation and direct parameter prediction, which is 50 times faster than existing methods with superior accuracy.
Background & Motivation¶
Background: Representing 3D environments with aligned CAD models is crucial for downstream tasks like AR and robotics; compared to noisy point clouds/meshes, CAD representations offer advantages such as being hole-free, having clean geometry, and providing object-level annotations.
Limitations of Prior Work: Current SOTA methods are computationally heavy, requiring sequential encoding of detected objects followed by optimization-based CAD alignment in a second stage, with running times ranging from 2.6 seconds to 20 minutes.
Key Challenge: High-precision CAD retrieval and alignment require intensive computation, failing to meet the real-time requirements of downstream applications (AR/robotics).
Goal: To elevate CAD retrieval and alignment speeds to real-time levels while maintaining or even improving retrieval accuracy.
Key Insight: Single-stage direct prediction—simultaneously outputting alignment parameters and shape embeddings to bypass the latency of two-stage sequential pipelines, combined with embedding distillation to avoid using an encoder during inference.
Core Idea: Learning a high-quality CAD embedding space under a contrastive learning framework and distilling it into a single-stage detection network to achieve real-time CAD retrieval and alignment.
Method¶
Overall Architecture¶
FastCAD takes a point cloud (from an RGB-D scan or online 3D reconstruction output) as input, encodes it into a feature volume via sparse 3D convolutions, and then predicts for each sampled location \((\hat{x}, \hat{y}, \hat{z})\): classification probability \(\hat{\boldsymbol{p}}\), oriented bounding box (OBB) parameters \(\hat{\boldsymbol{b}}\), front-facing side classification \(\hat{\boldsymbol{f}}\), and shape embedding vector \(\hat{\boldsymbol{w}}\). During inference, \(\hat{\boldsymbol{w}}\) is used to retrieve the nearest neighbor CAD model, while \(\hat{\boldsymbol{b}}\) and \(\hat{\boldsymbol{f}}\) are utilized for alignment.
Key Designs¶
-
Embedding Distillation:
- Function: Distills the high-quality shape embeddings obtained from contrastive learning into the detection network, avoiding the need for an independent encoder during inference.
- Mechanism: First, an independent encoder network is trained using a contrastive learning framework to construct a unified scan-to-CAD embedding space. Then, during the training of FastCAD, the embedding vectors of ground-truth (GT) CAD models are used as supervision signals: \(\mathcal{L}_{\text{emb}}(\hat{\boldsymbol{w}}_i, \boldsymbol{w}_i)\) is formulated as an MSE loss.
- Design Motivation: Two-stage retrieval (detecting bbox then encoding) suffers from distribution shift—the encoder is trained with cropped GT bboxes but evaluated with cropped predicted bboxes, leading to a severe performance drop (shape accuracy drops from 83.1% to 51.0%). Direct distillation allows the network to better utilize surrounding context and neighborhood information.
-
Contrastive Embedding Space:
- Function: Learns a unified embedding space where embedding vectors of noisy scanned objects and clean CAD models can be directly compared.
- Mechanism: Employs a triplet loss: \(\mathcal{L}_{\text{Contrastive}} = \max(0, d^2(\mathbf{A}, \mathbf{P}) + m - d^2(\mathbf{A}, \mathbf{N}))\), where \(\mathbf{A}\) is the anchor (scanned object), \(\mathbf{P}\) is the positive sample (corresponding CAD), and \(\mathbf{N}\) is the negative sample (different CAD in the same class).
- Two Auxiliary Tasks:
- Foreground/Background Segmentation: Supervised with binary cross-entropy, forcing the encoder to learn to distinguish the target object from background noise.
- Chamfer Distance Prediction: A shallow MLP is trained to predict the Chamfer distance between positive and negative CAD models from their embeddings: \(\mathcal{L}_{\text{Chamfer}} = \|d_\theta(\text{cat}(\mathbf{P}, \mathbf{N})) - d_{\text{Chamfer}}(X_{\text{pos}}, X_{\text{neg}})\|_1\), which helps the network learn embeddings that encapsulate shape similarity information.
- Design Motivation: Contrastive loss alone is insufficient, as negative samples can sometimes be highly similar to positive ones. The auxiliary tasks enforce the embedding to retain fine-grained shape information. Ablation studies show that the two auxiliary tasks improve shape accuracy from 81.1% to 83.1%.
-
Front-Facing Side Prediction:
- Function: Predicts the orientation of the CAD model inside the bounding box (which of the four sides is the front).
- Mechanism: Predicts a classification probability \(\hat{\boldsymbol{f}} \in \mathbb{R}^4\) for each oriented bounding box, using the cross-entropy loss \(\mathcal{L}_{\text{ff}}\). Target labels are modified for symmetric objects; for example, a 2-fold symmetry is modified to \((\frac{1}{2}, 0, \frac{1}{2}, 0)\), and a 4-fold symmetry to \((\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})\).
- Design Motivation: Compared to encoding orientation inside the embedding (which would require storing 4 embeddings per CAD), predicting the front orientation independently is not only more accurate (61.7% vs 56.2%) but also reduces the number of stored/searched embeddings by four times. Leveraging symmetry annotations yields an additional 1.6% improvement.
Loss & Training¶
Total loss: $\(\mathcal{L}_{\text{tot}} = \frac{1}{N_{\text{mat}}} \sum_{i=1}^{N_{\text{det}}} \mathcal{L}_{\text{cls}}(\hat{\boldsymbol{p}}_i, \boldsymbol{p}_i) + \mathbb{1}_i \left( \mathcal{L}_{\text{bb}}(\hat{\boldsymbol{b}}_i, \boldsymbol{b}_i) + \mathcal{L}_{\text{ff}}(\hat{\boldsymbol{f}}_i, \boldsymbol{f}_i) + \mathcal{L}_{\text{emb}}(\hat{\boldsymbol{w}}_i, \boldsymbol{w}_i) \right)\)$
- Classification loss \(\mathcal{L}_{\text{cls}}\): focal loss
- Bounding box loss \(\mathcal{L}_{\text{bb}}\): DIoU loss
- Front-facing side \(\mathcal{L}_{\text{ff}}\): cross-entropy
- Shape embedding \(\mathcal{L}_{\text{emb}}\): MSE loss
Training details: Optimized with AdamW, learning rate 1e-3, for 225 epochs. The encoder employs the Perceiver architecture with a 256-dimensional embedding, trained for 750 epochs. For video settings, a separate version of FastCAD is trained on reconstruction outputs.
Key Experimental Results¶
Main Results¶
| Method | Input | Class Acc | Instance Acc | Inference Time |
|---|---|---|---|---|
| Scan2CAD | RGB-D | 35.6% | 31.7% | 740s |
| SceneCAD | RGB-D | 52.3% | 61.2% | 2.6s |
| FastCAD (Scan) | RGB-D | 52.8% | 61.7% | 50ms |
| RayTran | Video | 36.2% | 43.0% | - |
| FastCAD (Video) | Video | 39.3% | 48.2% | 100ms |
Ablation Study¶
| Configuration | Align Acc | Recon Acc | Shape Acc | Explanation |
|---|---|---|---|---|
| Two-stage retrieval (pred bbox) | 61.7% | 15.6% | 51.0% | Severe distribution shift |
| Two-stage retrieval (GT bbox) | 61.7% | 30.6% | 78.1% | Even GT bbox is worse than distillation |
| Embedding distillation (final) | 61.7% | 41.7% | 83.1% | Distillation significantly outperforms two-stage |
| Contrastive learning only | 62.3% | 38.3% | 81.1% | Baseline |
| +Chamfer + Segmentation | 61.7% | 41.7% | 83.1% | Auxiliary tasks improve by 3.4%/2.0% |
| PointNet++ encoder | 61.5% | 29.6% | 74.0% | Weak encoder |
| Perceiver encoder | 62.3% | 38.3% | 81.1% | Strong encoder is significantly better |
Key Findings¶
- FastCAD achieves a 50x speedup (50ms vs. 2.6s) under scan input with slightly better accuracy (61.7% vs. 61.2%).
- Under video input, accuracy dramatically improves from 43.0% to 48.2%, while latency drops from ~3200ms to 100ms (10 FPS real-time support).
- Reconstruction accuracy improves from 22.9% to 29.6% (compared against Vid2CAD using the same retrieval settings).
- High-quality embedding space: Even when retrieving the 10th nearest neighbor CAD model, shape accuracy remains strong.
- Color information contributes minimally; key information is encoded in the geometric structures.
Highlights & Insights¶
- Embedding distillation is an elegant design choice—it avoids dual-network overhead during inference while solving the distribution shift issue in two-stage pipelines.
- Symmetry-aware front-facing side prediction is an overlooked but highly effective technique that leverages object symmetry annotations to reduce ambiguity.
- The design intuition behind auxiliary tasks (segmentation + Chamfer distance prediction) is stellar, forcing the embedding space to not only separate positive/negative samples but also encode fine-grained distances between shapes.
- The plug-and-play integration with online 3D reconstruction methods (e.g., DG Recon) highlights the advantages of modular design.
- The newly proposed Scan2CAD reconstruction accuracy and shape accuracy metrics successfully fill an evaluation gap.
Limitations & Future Work¶
- Poor performance on the "display" class (only 24.1%), caused by inconsistent CAD model orientations in ShapeNet (28 out of 149 have opposite orientations).
- Lack of temporal consistency mechanism under video settings, which may cause discontinuous CAD predictions across subsequent frames.
- Dependence on the CAD model database—retrieval fails if no structurally similar models are present in the library.
- Evaluated only on ScanNet/Scan2CAD; scene scale and diversity remain limited.
- Generalization performance under open-vocabulary or zero-shot scenarios is left unexplored.
Related Work & Insights¶
- vs SceneCAD: SceneCAD utilizes scene graphs and support relations in post-processing, yielding comparable accuracy but running 50x slower. FastCAD achieves a superior speed-accuracy trade-off via an end-to-end single-stage design.
- vs ScanNotate: ScanNotate exhaustively renders all CADs for matching; its shape accuracy is close to FastCAD (83.5% vs. 83.1%), but it is four orders of magnitude slower.
- vs RayTran: RayTran directly predicts on 3D volumetric features, resulting in extremely high computational intensity that prevents online deployment. FastCAD enables plug-and-play operation by choosing explicit point clouds as intermediate representations.
- vs Vid2CAD: Vid2CAD's frame-by-frame detection-plus-tracking pipeline is fragile and error-prone, whereas FastCAD is much more robust by adopting a reconstruct-then-detect strategy.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of embedding distillation, auxiliary tasks, and symmetry-aware orientation prediction is cleverly designed, though individual components are not entirely novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely complete, featuring new metrics, dual scan/video settings, rigorous ablations, and online incremental evaluations.
- Writing Quality: ⭐⭐⭐⭐ Structured and clear, with rich diagrams and well-justified motivations and design decisions.
- Value: ⭐⭐⭐⭐ Achieving 50x speedup with superior accuracy holds strong potential for real-time CAD reconstruction, and the newly proposed evaluation metrics offer long-term value.