FastCAD: Real-Time CAD Retrieval and Alignment from Scans and Videos¶

Conference: ECCV 2024
arXiv: 2403.15161
Code: No public code
Area: 3D Vision
Keywords: CAD model retrieval, 3D alignment, embedding distillation, real-time 3D reconstruction, contrastive learning

TL;DR¶

This work proposes FastCAD to achieve CAD model retrieval and alignment for all objects in a scene within 50ms through contrastive learning embedding space distillation and direct parameter prediction, which is 50 times faster than existing methods with superior accuracy.

Background & Motivation¶

Background: Representing 3D environments with aligned CAD models is crucial for downstream tasks like AR and robotics; compared to noisy point clouds/meshes, CAD representations offer advantages such as being hole-free, having clean geometry, and providing object-level annotations.

Limitations of Prior Work: Current SOTA methods are computationally heavy, requiring sequential encoding of detected objects followed by optimization-based CAD alignment in a second stage, with running times ranging from 2.6 seconds to 20 minutes.

Key Challenge: High-precision CAD retrieval and alignment require intensive computation, failing to meet the real-time requirements of downstream applications (AR/robotics).

Goal: To elevate CAD retrieval and alignment speeds to real-time levels while maintaining or even improving retrieval accuracy.

Key Insight: Single-stage direct prediction—simultaneously outputting alignment parameters and shape embeddings to bypass the latency of two-stage sequential pipelines, combined with embedding distillation to avoid using an encoder during inference.

Core Idea: Learning a high-quality CAD embedding space under a contrastive learning framework and distilling it into a single-stage detection network to achieve real-time CAD retrieval and alignment.

Method¶

Overall Architecture¶

FastCAD takes a point cloud (from an RGB-D scan or online 3D reconstruction output) as input, encodes it into a feature volume via sparse 3D convolutions, and then predicts for each sampled location $(\hat{x}, \hat{y}, \hat{z})$: classification probability $\hat{\boldsymbol{p}}$, oriented bounding box (OBB) parameters $\hat{\boldsymbol{b}}$, front-facing side classification $\hat{\boldsymbol{f}}$, and shape embedding vector $\hat{\boldsymbol{w}}$. During inference, $\hat{\boldsymbol{w}}$ is used to retrieve the nearest neighbor CAD model, while $\hat{\boldsymbol{b}}$ and $\hat{\boldsymbol{f}}$ are utilized for alignment.

Key Designs¶

Embedding Distillation:
- Function: Distills the high-quality shape embeddings obtained from contrastive learning into the detection network, avoiding the need for an independent encoder during inference.
- Mechanism: First, an independent encoder network is trained using a contrastive learning framework to construct a unified scan-to-CAD embedding space. Then, during the training of FastCAD, the embedding vectors of ground-truth (GT) CAD models are used as supervision signals: $\mathcal{L}_{\text{emb}}(\hat{\boldsymbol{w}}_i, \boldsymbol{w}_i)$ is formulated as an MSE loss.
- Design Motivation: Two-stage retrieval (detecting bbox then encoding) suffers from distribution shift—the encoder is trained with cropped GT bboxes but evaluated with cropped predicted bboxes, leading to a severe performance drop (shape accuracy drops from 83.1% to 51.0%). Direct distillation allows the network to better utilize surrounding context and neighborhood information.
Contrastive Embedding Space:
- Function: Learns a unified embedding space where embedding vectors of noisy scanned objects and clean CAD models can be directly compared.
- Mechanism: Employs a triplet loss: $\mathcal{L}_{\text{Contrastive}} = \max(0, d^2(\mathbf{A}, \mathbf{P}) + m - d^2(\mathbf{A}, \mathbf{N}))$, where $\mathbf{A}$ is the anchor (scanned object), $\mathbf{P}$ is the positive sample (corresponding CAD), and $\mathbf{N}$ is the negative sample (different CAD in the same class).
- Two Auxiliary Tasks:
  - Foreground/Background Segmentation: Supervised with binary cross-entropy, forcing the encoder to learn to distinguish the target object from background noise.
  - Chamfer Distance Prediction: A shallow MLP is trained to predict the Chamfer distance between positive and negative CAD models from their embeddings: $\mathcal{L}_{\text{Chamfer}} = \|d_\theta(\text{cat}(\mathbf{P}, \mathbf{N})) - d_{\text{Chamfer}}(X_{\text{pos}}, X_{\text{neg}})\|_1$, which helps the network learn embeddings that encapsulate shape similarity information.
- Design Motivation: Contrastive loss alone is insufficient, as negative samples can sometimes be highly similar to positive ones. The auxiliary tasks enforce the embedding to retain fine-grained shape information. Ablation studies show that the two auxiliary tasks improve shape accuracy from 81.1% to 83.1%.
Front-Facing Side Prediction:
- Function: Predicts the orientation of the CAD model inside the bounding box (which of the four sides is the front).
- Mechanism: Predicts a classification probability $\hat{\boldsymbol{f}} \in \mathbb{R}^4$ for each oriented bounding box, using the cross-entropy loss $\mathcal{L}_{\text{ff}}$. Target labels are modified for symmetric objects; for example, a 2-fold symmetry is modified to $(\frac{1}{2}, 0, \frac{1}{2}, 0)$, and a 4-fold symmetry to $(\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4})$.
- Design Motivation: Compared to encoding orientation inside the embedding (which would require storing 4 embeddings per CAD), predicting the front orientation independently is not only more accurate (61.7% vs 56.2%) but also reduces the number of stored/searched embeddings by four times. Leveraging symmetry annotations yields an additional 1.6% improvement.

Loss & Training¶

Total loss: $$\mathcal{L}_{\text{tot}} = \frac{1}{N_{\text{mat}}} \sum_{i=1}^{N_{\text{det}}} \mathcal{L}_{\text{cls}}(\hat{\boldsymbol{p}}_i, \boldsymbol{p}_i) + \mathbb{1}_i \left( \mathcal{L}_{\text{bb}}(\hat{\boldsymbol{b}}_i, \boldsymbol{b}_i) + \mathcal{L}_{\text{ff}}(\hat{\boldsymbol{f}}_i, \boldsymbol{f}_i) + \mathcal{L}_{\text{emb}}(\hat{\boldsymbol{w}}_i, \boldsymbol{w}_i) \right)$$

Classification loss $\mathcal{L}_{\text{cls}}$: focal loss
Bounding box loss $\mathcal{L}_{\text{bb}}$: DIoU loss
Front-facing side $\mathcal{L}_{\text{ff}}$: cross-entropy
Shape embedding $\mathcal{L}_{\text{emb}}$: MSE loss

Training details: Optimized with AdamW, learning rate 1e-3, for 225 epochs. The encoder employs the Perceiver architecture with a 256-dimensional embedding, trained for 750 epochs. For video settings, a separate version of FastCAD is trained on reconstruction outputs.

Key Experimental Results¶

Main Results¶

Method	Input	Class Acc	Instance Acc	Inference Time
Scan2CAD	RGB-D	35.6%	31.7%	740s
SceneCAD	RGB-D	52.3%	61.2%	2.6s
FastCAD (Scan)	RGB-D	52.8%	61.7%	50ms
RayTran	Video	36.2%	43.0%	-
FastCAD (Video)	Video	39.3%	48.2%	100ms

Ablation Study¶

Configuration	Align Acc	Recon Acc	Shape Acc	Explanation
Two-stage retrieval (pred bbox)	61.7%	15.6%	51.0%	Severe distribution shift
Two-stage retrieval (GT bbox)	61.7%	30.6%	78.1%	Even GT bbox is worse than distillation
Embedding distillation (final)	61.7%	41.7%	83.1%	Distillation significantly outperforms two-stage
Contrastive learning only	62.3%	38.3%	81.1%	Baseline
+Chamfer + Segmentation	61.7%	41.7%	83.1%	Auxiliary tasks improve by 3.4%/2.0%
PointNet++ encoder	61.5%	29.6%	74.0%	Weak encoder
Perceiver encoder	62.3%	38.3%	81.1%	Strong encoder is significantly better

Key Findings¶

FastCAD achieves a 50x speedup (50ms vs. 2.6s) under scan input with slightly better accuracy (61.7% vs. 61.2%).
Under video input, accuracy dramatically improves from 43.0% to 48.2%, while latency drops from ~3200ms to 100ms (10 FPS real-time support).
Reconstruction accuracy improves from 22.9% to 29.6% (compared against Vid2CAD using the same retrieval settings).
High-quality embedding space: Even when retrieving the 10th nearest neighbor CAD model, shape accuracy remains strong.
Color information contributes minimally; key information is encoded in the geometric structures.

Highlights & Insights¶

Embedding distillation is an elegant design choice—it avoids dual-network overhead during inference while solving the distribution shift issue in two-stage pipelines.
Symmetry-aware front-facing side prediction is an overlooked but highly effective technique that leverages object symmetry annotations to reduce ambiguity.
The design intuition behind auxiliary tasks (segmentation + Chamfer distance prediction) is stellar, forcing the embedding space to not only separate positive/negative samples but also encode fine-grained distances between shapes.
The plug-and-play integration with online 3D reconstruction methods (e.g., DG Recon) highlights the advantages of modular design.
The newly proposed Scan2CAD reconstruction accuracy and shape accuracy metrics successfully fill an evaluation gap.

Limitations & Future Work¶

Poor performance on the "display" class (only 24.1%), caused by inconsistent CAD model orientations in ShapeNet (28 out of 149 have opposite orientations).
Lack of temporal consistency mechanism under video settings, which may cause discontinuous CAD predictions across subsequent frames.
Dependence on the CAD model database—retrieval fails if no structurally similar models are present in the library.
Evaluated only on ScanNet/Scan2CAD; scene scale and diversity remain limited.
Generalization performance under open-vocabulary or zero-shot scenarios is left unexplored.

vs SceneCAD: SceneCAD utilizes scene graphs and support relations in post-processing, yielding comparable accuracy but running 50x slower. FastCAD achieves a superior speed-accuracy trade-off via an end-to-end single-stage design.
vs ScanNotate: ScanNotate exhaustively renders all CADs for matching; its shape accuracy is close to FastCAD (83.5% vs. 83.1%), but it is four orders of magnitude slower.
vs RayTran: RayTran directly predicts on 3D volumetric features, resulting in extremely high computational intensity that prevents online deployment. FastCAD enables plug-and-play operation by choosing explicit point clouds as intermediate representations.
vs Vid2CAD: Vid2CAD's frame-by-frame detection-plus-tracking pipeline is fragile and error-prone, whereas FastCAD is much more robust by adopting a reconstruct-then-detect strategy.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of embedding distillation, auxiliary tasks, and symmetry-aware orientation prediction is cleverly designed, though individual components are not entirely novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely complete, featuring new metrics, dual scan/video settings, rigorous ablations, and online incremental evaluations.
Writing Quality: ⭐⭐⭐⭐ Structured and clear, with rich diagrams and well-justified motivations and design decisions.
Value: ⭐⭐⭐⭐ Achieving 50x speedup with superior accuracy holds strong potential for real-time CAD reconstruction, and the newly proposed evaluation metrics offer long-term value.