Skip to content

SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design

Conference: CVPR2026 arXiv: 2603.13098 Code: None Area: Others (3D CAD Generation / Multimodal Dataset) Keywords: CAD dataset, language-driven 3D design, multimodal alignment, parametric modeling, text-to-CAD, encoder-decoder, SolidWorks

TL;DR

This paper presents SldprtNet, a large-scale multimodal CAD dataset comprising 242,000+ industrial parts, where each sample contains four fully aligned modalities: .sldprt/.step 3D models, seven-view composite images, parametric modeling scripts, and natural language descriptions. The authors develop a lossless encoder/decoder toolchain supporting 13 CAD commands, and baseline experiments demonstrate the significant advantage of multimodal input over text-only input for CAD generation tasks.

Background & Motivation

Scarcity of CAD Datasets: Compared to image and text datasets, CAD datasets are extremely small in scale — each sample must be manually created by domain experts using professional software at high cost. This data scarcity directly constrains research progress in semantics-driven CAD modeling.

Missing Modalities in Existing Datasets: Mainstream 3D datasets suffer from severe modality incompleteness. Non-parametric datasets (ModelNet, ShapeNet, Thingi10K) provide only meshes or point clouds, discarding design history and parametric information, and thus cannot support editable modeling. Parametric datasets (ABC, Fusion 360) preserve B-Rep or modeling sequences but lack text annotations and visual information, making them unsuitable for language-driven or cross-modal tasks.

Key Bottleneck in Text-to-CAD: DeepCAD pioneered the formulation of CAD modeling as a sequence generation problem but supports only 2D sketching and extrusion; Text2CAD introduced text guidance but relies on synthetic text, resulting in semantic–geometry alignment gaps and a complete absence of the image modality.

Insights from Multimodal Learning: Works such as CLIP, Flamingo, and BLIP-2 have demonstrated that cross-modal alignment learning is critical for improving generalization and zero-shot reasoning. The CAD domain equally requires a unified multimodal dataset that integrates geometry, visual information, parametric sequences, and natural language to provide richer supervision signals.

Limitations of Existing CAD-LLMs: Recent works including CAD-GPT, CAD-MLLM, CAD-Coder, and CAD-Llama have demonstrated the potential of multimodal and code-driven approaches, yet each has shortcomings — small data scale, limited command coverage, high proportion of synthetic data, lack of executable modeling sequences, or deviation from industrial CAD workflows.

Method

Overall Architecture

The SldprtNet construction pipeline consists of four stages: (1) collecting approximately 680,000 .sldprt industrial part models from three public platforms — GrabCAD, McMaster-Carr, and FreeCAD; (2) filtering and retaining 242,000+ high-quality samples containing 13 representative feature types; (3) generating multi-view images, parametric text representations, and standard format conversions for each sample via an automated pipeline; and (4) generating natural language descriptions using a multimodal language model followed by manual verification and alignment correction.

Each sample ultimately contains five aligned modalities: - .sldprt file: SolidWorks native format encoding the complete feature tree history - .step file: Standard exchange format supporting cross-platform validation - Multi-view composite image: Six orthographic views (front/back/left/right/top/bottom) plus one isometric view composited into a single PNG - Parametric modeling script (Encoder_txt): Structured text representation containing the feature tree and detailed parameters for each feature - Natural language description (Des_txt): Appearance and functional description generated by Qwen2.5-VL-7B

Key Designs: Encoder and Decoder

Two core tools developed around the SolidWorks COM interface form a lossless bidirectional conversion system:

Encoder (CAD → Text): 1. Automatically traverses the Feature Tree of a .sldprt file and extracts feature types, names, and parent–child relationships in modeling-history order 2. Invokes type-specific modules to extract detailed parameters (dimensions, constraints, sketch entities, dependencies, etc.) 3. Outputs a unified human-readable, machine-parseable structured text format

Decoder (Text → CAD): 1. Creates a blank .sldprt document to initialize the modeling environment 2. Parses the Feature Tree from Encoder_txt and incrementally calls the SolidWorks API to reconstruct each feature according to feature order and hierarchy 3. Ensures geometric and topological consistency with the source model

The encoder/decoder supports 13 CAD operations, including 2D Sketch, Extrusion, Chamfer, Fillet, Linear Pattern, Mirror Pattern, and others — far exceeding the 2 operation types in DeepCAD and substantially expanding the diversity of representable parts.

Natural Language Description Generation

Qwen2.5-VL-7B is used with composite images and parametric scripts as input to generate appearance and functional descriptions of each part. Inference runs on 12 NVIDIA A100 GPUs for 368 GPU-hours to generate descriptions for 242,000+ samples, which are subsequently validated and alignment-corrected by human annotators. The strategy of compositing seven rendered views into a single image effectively reduces input token length and accelerates inference.

Dataset Complexity Stratification

Models are stratified into four complexity levels based on the number of CAD commands in each part's Feature Tree:

Complexity Level Feature Count Samples Proportion
Level 1 (Simple) 1–5 93,188 38.4%
Level 2 (Intermediate) 6–10 78,926 32.5%
Level 3 (Advanced) 11–100 69,259 28.5%
Level 4 (Expert) >100 1,234 0.5%

The three primary levels are relatively balanced in sample proportion, ensuring training coverage; Expert-level samples, though few, are essential for evaluating reasoning depth. This stratification also supports curriculum learning strategies — establishing geometric understanding from simple samples before progressively introducing high-complexity samples to improve generalization.

Key Experimental Results

Main Results: Single-Modal vs. Multimodal Baseline Comparison

Qwen2.5-7B (text-only) and Qwen2.5-7B-VL (image + text) are fine-tuned on a 50K-sample subset and evaluated on a test set of 3,644 samples:

Metric Qwen2.5-7B (Text-only) Qwen2.5-7B-VL (Multimodal) Gain
Exact Match Score 0.0058 0.0099 +70.7%
BLEU Score 97.1827 97.9309 +0.77%
Command-Level F1 0.3247 0.3670 +13.0%
Tolerance Accuracy 0.5016 0.4630 −7.7%
Partial Match Rate 0.5554 0.6162 +10.9%

Dataset Comparison

Dataset # Models Format Parametric Multi-view Reconstructable Text Desc.
SldprtNet 242,606 Sldprt
ABC 1,000,000+ B-Rep
ShapeNet 3,000,000+ Mesh
ModelNet 151,128 Mesh
Fusion 360 50,000+ B-Rep
Thingi10K 10,000 Mesh

Key Findings

  1. Multimodal significantly outperforms single-modal: The multimodal model achieves clear improvements on the three core metrics — Exact Match (+70.7%), Command-Level F1 (+13.0%), and Partial Match Rate (+10.9%) — indicating that visual information plays a key role in geometric semantic understanding and modeling logic reasoning.
  2. Counter-intuitive result on Tolerance Accuracy: The text-only model achieves slightly higher parameter tolerance accuracy (0.5016 vs. 0.4630); the authors hypothesize this reflects overfitting to numerical values rather than genuine structured semantic understanding.
  3. Uniqueness of SldprtNet: Among the six datasets compared, SldprtNet is the only one that simultaneously satisfies all four properties: parametric, multi-view, reconstructable, and text description.
  4. 2D Sketch dominates feature distribution: As the foundation for nearly all 3D geometry, 2D Sketch is the most frequently used feature; the high frequency of Chamfer and Fillet reflects the industrial part nature of the dataset.

Highlights & Insights

  1. Closed-loop validation design: The lossless bidirectional encoder–decoder conversion not only addresses data scalability but also provides a structured means of validating model outputs — generated CAD sequences can be reconstructed via the decoder and compared against the original model, enabling automated quality control.
  2. Seven-view compositing strategy: Compositing seven rendered views into a single image is highly practical, substantially reducing the input token count and inference time for multimodal models without sacrificing visual completeness.
  3. Coverage of 13 operations: Compared to DeepCAD's 2 operations, SldprtNet's support for 13 CAD commands greatly increases the diversity of representable parts and brings the dataset closer to real industrial design scenarios.
  4. Native support for curriculum learning: The four-level complexity stratification provides out-of-the-box support for curriculum learning, which may yield significant training efficiency gains for long-sequence CAD generation tasks.

Limitations & Future Work

  1. Overly simple baselines: Only single-modal vs. multimodal variants of the same model are compared; direct comparisons with existing methods such as DeepCAD, Text2CAD, and CAD-GPT are absent, making it difficult to quantify the dataset's actual gains on state-of-the-art methods.
  2. Extremely low Exact Match: The best multimodal model achieves an Exact Match of only 0.0099, indicating that faithfully reproducing complete CAD modeling sequences remains highly challenging, yet failure modes are not analyzed in depth.
  3. Description quality depends on the generative model: Natural language descriptions are generated by Qwen2.5-VL-7B and manually verified, but the coverage rate and standards of verification are not detailed, potentially introducing systematic biases.
  4. Domain limited to industrial parts: The dataset is sourced from industrial parts on platforms such as GrabCAD, limiting coverage of other CAD domains such as architecture and consumer products.
  5. SolidWorks dependency: The encoder/decoder relies on the SolidWorks COM interface, restricting portability to other CAD platforms.
  6. No geometric reconstruction quality evaluation: The paper does not provide quantitative analysis of geometric accuracy for the encoder–decoder round-trip conversion.
  • DeepCAD (ICCV 2021): A pioneering work formalizing CAD modeling as sequence generation, but limited to sketch-and-extrude operations; SldprtNet greatly expands command coverage by supporting 13 operations.
  • Text2CAD (NeurIPS 2024): The first text-guided CAD generation work, but it relies on synthetic text and lacks the image modality; SldprtNet addresses this gap through multimodal alignment.
  • ABC Dataset (CVPR 2019): The benchmark million-scale B-Rep CAD dataset, but completely lacking text and visual modalities.
  • CAD-Coder: The first open-source vision–language model for image-to-CadQuery code generation, but its reliance on a lightweight DSL deviates from industrial workflows.
  • The central insight from this work is that multimodal aligned datasets have a direct positive impact on CAD generation quality, with clear gains observable even from relatively simple baselines.

Rating

  • Novelty: ⭐⭐⭐⭐ — As the first large-scale CAD dataset providing complete four-modality alignment (parametric model, multi-view images, modeling script, and natural language description), it addresses a clear gap in the field.
  • Experimental Thoroughness: ⭐⭐⭐ — The multimodal vs. single-modal baseline comparison effectively validates the core claim, but broad comparisons with existing methods and in-depth ablation studies are lacking.
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure, well-motivated contributions, and systematic dataset design principles; experimental analysis is insufficiently deep in places.
  • Value: ⭐⭐⭐⭐ — The 242K-scale multimodal CAD dataset and accompanying toolchain offer high practical value to the Text-to-CAD community, though SolidWorks dependency may limit adoption upon actual open-source release.