CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images¶
Authors: Cheng Chen, Xiaohui Zeng, Yuwei Li, Hippolyte Music, Tobias Ritschel, Sanja Fidler, Florian Shkurti
Affiliations: NTU / ASTAR / UT Austin
Conference*: CVPR 2025
Background & Motivation¶
Generating editable CAD models from a single image is a long-standing goal in computer vision and computer graphics. Existing methods typically rely on parametric surface fitting or template-based retrieval, but these approaches face severe limitations:
Discreetness of CAD programs: CAD models consist of a sequence of discrete modeling operations (e.g., extrusion, chamfering, Boolean operations), which traditional continuous optimization methods struggle to handle directly.
Huge semantic gap from image to CAD: RGB images contain non-geometric information such as textures and illumination, whereas CAD programs focus purely on geometric operations, making the mapping between them extremely challenging.
Data scarcity: High-quality paired image-CAD data is extremely scarce; existing datasets (such as DeepCAD) contain only CAD programs without corresponding images.
Hard-to-guarantee generation quality: Generated CAD programs may contain syntax errors or geometric inconsistencies, preventing successful compilation by CAD compilers.
These issues motivate the authors to propose CADCrafter, an Image-to-CAD system based on a latent diffusion model. It bridges the semantic gap between images and CAD programs via a geometry encoder and ensures generation quality using Direct Preference Optimization (DPO) fine-tuning.
Method¶
Overall Architecture¶
CADCrafter adopts a two-stage training strategy:
- Stage 1: Train the VAE and latent diffusion model for CAD sequences.
- Stage 2: Introduce geometric conditions and DPO fine-tuning.
Geometry Encoder¶
To bridge the semantic gap between RGB images and CAD programs, a multimodal geometry encoder is designed:
| Feature Type | Extraction Method | Function |
|---|---|---|
| Depth Map (Depth) | Pre-trained monocular depth estimation | Provides global 3D shape information |
| Normal Map (Normal) | Pre-trained normal estimation | Captures local surface orientation |
| DINO-v2 Semantic Features | Pre-trained DINO-v2 | Provides high-level semantic understanding |
The three types of features are integrated through a Feature Fusion Module and then injected into the cross-attention layers of the diffusion model as conditions.
Latent Diffusion Model¶
A CAD program is represented as a sequence of commands \(S = \{c_1, c_2, ..., c_N\}\), where each command \(c_i\) contains an operation type and parameters. The detailed pipeline is as follows:
- VQ-VAE Encoding: Encodes the CAD command sequence into a latent vector \(z = E(S)\).
- Forward Diffusion: \(q(z_t | z_{t-1}) = \mathcal{N}(z_t; \sqrt{1-\beta_t} z_{t-1}, \beta_t I)\)
- Reverse Denoising: Conditional diffusion model \(p_\theta(z_{t-1} | z_t, c_{geo})\), where \(c_{geo}\) represents the geometric condition.
- Decoding: \(\hat{S} = D(\hat{z}_0)\)
Direct Preference Optimization (DPO)¶
One of the key innovations is using a CAD compiler as an automatic quality evaluator:
- Positive samples: Generated programs that can be successfully compiled by the CAD compiler.
- Negative samples: Generated programs that fail to compile or exhibit large geometric errors.
- DPO Loss: \(\mathcal{L}_{DPO} = -\log \sigma(\beta (\log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}))\)
- Hyperparameter \(\beta = 20\), controlling the strength of preference learning.
DPO fine-tuning significantly improves the compilation success rate and geometric accuracy of the generated programs.
Multi-view to Single-view Distillation¶
Multi-view renderings are used as conditions during training (providing richer geometric information), while knowledge distillation is leveraged during inference so that the model can operate using only single-view inputs:
- Teacher model: Conditioned on multi-view geometric features.
- Student model: Conditioned on single-view geometric features.
- Distillation loss: Aligns the denoising predictions of both models in the latent space.
Key Experimental Results¶
DeepCAD Dataset¶
| Method | Acc_cmd ↑ | Acc_param ↑ | Med CD ↓ | Invalid ↓ |
|---|---|---|---|---|
| DeepCAD | 78.41% | 72.15% | 0.082 | 12.3% |
| SkexGen | 80.56% | 74.33% | 0.067 | 9.8% |
| CADCrafter (Ours) | 83.23% | 77.89% | 0.049 | 4.2% |
Ablation Study¶
| Configuration | Acc_cmd | Med CD |
|---|---|---|
| w/o Geometry Encoder | 76.8% | 0.091 |
| Depth Only | 79.4% | 0.072 |
| Depth + Normal | 81.5% | 0.058 |
| All Features (D+N+DINO) | 82.1% | 0.053 |
| + DPO Fine-tuning | 83.23% | 0.049 |
RealCAD Dataset¶
The authors construct the RealCAD dataset to evaluate performance in real-world scenarios: - It contains multi-angle captured images of 3D-printed physical models. - CADCrafter demonstrates robust CAD reconstruction capabilities on real-world images as well. - Qualitative results show that the generated CAD models maintain sound topological structures and editability.
Highlights & Insights¶
- Geometric conditioning bridge: Effectively bridges the semantic gap between images and CAD programs using a multimodal geometry encoder composed of Depth + Normal + DINO-v2.
- DPO fine-tuning strategy: Innovatively utilizes a CAD compiler as an automatic preference labeling tool, improving generation quality through DPO.
- Multi-view distillation: Leverages multi-view information during training while requiring only a single-view input during inference.
- RealCAD dataset: The first Image-to-CAD evaluation dataset featuring 3D-printed physical objects.
Limitations & Future Work¶
- Currently, only CAD models formed by extrusion operations are supported, while complex operations such as revolution and sweeping are not yet supported.
- For images with heavy occlusions or complex textures, depth and normal estimations may be inaccurate.
- DPO fine-tuning relies on binary feedback (success/failure) from the CAD compiler, lacking fine-grained quality evaluation.
Related Work & Insights¶
- DeepCAD: Pioneering work in Transformer-based CAD sequence generation.
- SkexGen: CAD generation based on sketch-extrusion separation.
- Point2CAD: CAD reconstruction from point clouds.
- Alignment applications of DPO in LLMs/diffusion models.