Zero-Shot Inexact CAD Model Alignment from a Single Image¶

Conference: ICCV 2025 arXiv: 2507.03292 Code: https://zerocad9d.github.io/ Area: 3D Vision Keywords: CAD alignment, zero-shot, 9-DoF pose estimation, foundation models, NOC

TL;DR¶

A weakly supervised 9-DoF CAD model alignment method that enhances DINOv2 features with geometry awareness and performs dense alignment optimization in Normalized Object Coordinate (NOC) space, enabling zero-shot 3D alignment without pose annotations that generalizes to unseen categories.

Background & Motivation¶

Recovering 3D scene structure from a single image is highly ill-posed (depth ambiguity + heavy occlusion). A practical solution is to retrieve an approximate 3D model from a database and align it with the target object in the image (9-DoF: 6D rigid transformation + 3D anisotropic scaling).

Limitations of existing methods: - Supervised methods (ROCA, SPARC): Require annotated quadruples of RGB + depth + CAD model + 9-DoF pose, and are limited to a fixed set of trained categories. - Synthetic data methods (DiffCAD): Rely on photorealistic synthetic scenes (3D-FRONT), with restricted categories and domain gaps. - Foundation model methods (FoundationPose): Designed for 6-DoF tasks and exact matching (consistent model texture/shape), performing poorly on inexact retrieved models. - Inherent limitations of DINOv2 features: (1) Symmetric parts (e.g., left and right chair legs) produce highly similar features and cannot be distinguished; (2) Sensitivity to texture variation makes handling texture-free models difficult.

Core insight: Although DINOv2 features cannot directly distinguish symmetric parts, they may implicitly encode latent information sufficient to predict part positions — which can be reorganized via a lightweight adapter.

Method¶

Overall Architecture¶

A coarse-to-fine pose estimation pipeline: 1. Coarse alignment: Encode image and 3D model into a shared feature space, establish 2D–3D correspondences via nearest-neighbor matching, and solve for an initial pose with RANSAC. 2. Fine alignment: Perform dense image-level alignment optimization in NOC space, backpropagating through differentiable rendering to refine pose parameters.

Key Designs¶

Geometry-aware feature adapter: A lightweight MLP $E_\theta$ is trained to transform DINOv2 features into geometry-aware features. Training data consists of 9 ShapeNet CAD categories rendered and augmented with a diffusion model (300K images total), optimized with two objectives:

NOC prediction loss: Encourages features to encode 3D positional information. $$\mathcal{L}_{\text{NOC}} = \frac{1}{n \cdot h \cdot w}\sum_{i=1}^{n}\|D_\phi(E_\theta(\text{DINO}(\mathbf{R}_i))) - \mathbf{N}_i\|_2^2$$

Geometric consistency triplet loss: Enforces cross-view feature consistency for the same part and feature dissimilarity for geometrically distant parts. $$\mathcal{L}_{\text{triplet}} = \frac{1}{|\mathcal{T}|}\sum_{(\mathbf{a},\mathbf{p},\mathbf{n})\in\mathcal{T}}[d(\mathbf{a},\mathbf{n}) - d(\mathbf{a},\mathbf{p}) + \alpha]_+$$ Positives: points with 3D distance $\leq \tau_{\text{dist}}^+=0.02$; negatives: points with distance $\geq \tau_{\text{dist}}^-=0.4$ and feature cosine similarity $> \tau_{\text{feat}}^-=0.75$ (hard negative mining, specifically targeting symmetric parts indistinguishable by DINOv2).

Final features fuse DINOv2 and adapter outputs: $E_f(\mathbf{I}) = (1-\omega)\cdot\hat{\text{DINO}}(\mathbf{I}) \oplus \omega\cdot\hat{E}_\theta(\text{DINO}(\mathbf{I}))$ ($\omega=0.5$).

3D model feature voxel grid: Each CAD model is rendered from 36 viewpoints with 7× augmentation (288 images total); features extracted by $E_f$ are back-projected into a 3D voxel grid ($100^3$) and averaged across views to form a unified 3D feature representation. Multi-scale downsampling–upsampling smoothing is applied to the voxel grid.
Dense image alignment optimization (NOC space): The input image is converted to a NOC map $\mathbf{N}^\mathbf{I}$ via nearest-neighbor matching (each pixel feature is matched to the closest 3D voxel, whose position serves as the NOC value). Three losses are optimized:
NOC alignment loss: $\mathcal{L}_{\text{NOC-A}} = \frac{1}{m}\|\mathbf{M} \odot (\mathbf{N}^\mathbf{I} - \mathbf{N}^t)\|_1$
Silhouette loss: $\mathcal{L}_{\text{mask}} = \frac{1}{HW}\|\mathbf{S}^\mathbf{I} - \mathbf{S}^t\|_1$ (using SAM segmentation + SoftRasterizer differentiable rendering)
Depth loss: $\mathcal{L}_{\text{depth}} = \frac{1}{m}\|\mathbf{M} \odot (\mathbf{D}^\mathbf{I} - \mathbf{D}^t)\|_1$ (using DepthAnything for metric depth prediction)

Key advantage: The NOC map derived from nearest-neighbor matching is naturally invariant to global translation and scaling in feature space, making it more robust to domain shifts than direct neural network NOC regression.

Loss & Training¶

Adapter training: $\mathcal{L}_{\text{adapter}} = (1-\beta)\mathcal{L}_{\text{NOC}} + \beta\mathcal{L}_{\text{triplet}}$, $\beta=0.1$
2-layer MLP adapter + 1-layer MLP decoder (decoder discarded after training)
AdamW optimizer, lr=3e-4, batch=140
Coarse alignment: RANSAC + 3D–3D correspondence solving via metric depth back-projection
Fine alignment: Adam, lr=0.005, PyTorch3D differentiable rendering

Key Experimental Results¶

Main Results (ScanNet25k, 9-DoF NMS Alignment Accuracy)¶

Method	Supervision	Bathtub	Chair	Display	Sofa	Table	Avg Cat.↑	Avg Inst.↑
ROCA	Fully supervised	22.5	41.0	30.4	15.9	14.6	21.5	27.4
SPARC	Fully supervised	26.7	52.6	22.5	32.7	17.7	27.3	33.9
FoundationPose(9D)	Weakly supervised	20.0	41.8	23.6	15.0	17.5	19.2	25.7
Ours	Weakly supervised	16.7	49.3	24.1	38.1	16.5	23.1	30.1

The only weakly supervised method to surpass fully supervised ROCA (+1.6% avg. category, +2.7% avg. instance).

Ablation Study (ScanNet25k, Coarse–Fine Alignment Combinations)¶

Coarse	Fine	Avg Cat.↑	Avg Inst.↑
DINOv2	—	13.1	18.8
DINOv2	Ours (NOC)	17.3	24.2
Ours	—	18.3	26.0
Ours	FM (feature matching)	18.3	26.1
Ours	Ours (NOC)	23.1	30.1

Geometry-aware features improve over DINOv2 by +5.2%/+7.3%; NOC dense optimization improves over no fine alignment by +4.8%/+4.1% and over feature-matching fine alignment (FM) by +4.8%/+4.0%.

Generalization to Unseen Categories on SUN2CAD (20 categories, single-view accuracy)¶

Method	Supervision	piano	printer	lamp	mug	oven	Avg Cat.↑	Avg Inst.↑
SPARC	Fully supervised	27.8	14.1	3.0	0.0	0.0	6.9	4.9
DINOv2	None	44.4	7.6	5.4	0.0	7.1	11.6	7.3
Ours	Weakly supervised	50.0	25.0	6.1	10.2	57.1	24.5	17.6

Outperforms the strongest baseline by a large margin of +12.7% across 20 unseen categories, surpassing SPARC on 18 of 20 categories.

Key Findings¶

$\beta=0.1$ (triplet loss weight at 10%) yields the lowest NOC prediction error — excessive contrastive learning sacrifices positional prediction capacity.
Feature fusion at $\omega=0.5$ is optimal — DINOv2 semantic information and adapter geometric information are complementary.
The three dense alignment losses are complementary: NOC improves scale and rotation, depth improves translation and rotation, and silhouette further boosts performance.
Nearest-neighbor-based NOC prediction is more robust to domain shifts than direct neural network regression (invariant to global feature space offsets).

Highlights & Insights¶

Elegant hard negative mining design: Triplet negatives are specifically selected as points with high DINOv2 cosine similarity but large 3D distance, precisely targeting the weaknesses of foundation features on symmetric parts.
NOC space is better suited for dense alignment than feature space: NOC maps are inherently smooth, renderable via simple rasterization (no network inference required), and nearest-neighbor matching is invariant to global shifts in feature space.
Strong generalization capability: Training on only 9 categories generalizes to 20 entirely unseen categories, surpassing supervised methods that rely on category priors.
The proposed SUN2CAD benchmark fills the gap in 9-DoF alignment evaluation for unseen categories.

Limitations & Future Work¶

Coarse alignment is not robust for heavily occluded or cropped objects (e.g., tables, bathtubs, beds), affecting scale and rotation estimation.
Performance depends on the quality of SAM segmentation and DepthAnything depth predictions.
The adapter still requires training on 9 ShapeNet categories — training on larger-scale CAD renderings may further improve performance.
Building feature voxel grids for each object requires inference over 288 images, limiting real-time applicability.
SUN2CAD annotations are derived from coarse 3D bounding box alignment followed by manual refinement, resulting in limited annotation precision.

The paradigm of lightweight adapters for adapting foundation model features to specific geometric tasks is transferable to other 3D perception problems.
The implicit hard negative mining strategy in the triplet loss is applicable to any task requiring discrimination of semantically similar but positionally distinct parts.
Dense alignment in NOC space avoids the difficulties of RGB texture matching and is applicable to all inexact matching scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐ (Effective combination of feature adaptation and NOC-space optimization)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Multiple baselines, SUN2CAD benchmark, comprehensive ablations, hyperparameter analysis)
Writing Quality: ⭐⭐⭐⭐ (Clear problem formulation, intuitive pipeline diagrams)
Value: ⭐⭐⭐⭐⭐ (Zero-shot generalization substantially increases practical deployability)