SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design¶

Conference: ICRA 2026 (listed under CVPR 2026 on arXiv) arXiv: 2603.13098 Code: Anonymous repository (referenced in the paper) Area: 3D Generation / CAD Modeling / Multimodal Datasets Keywords: CAD generation, multimodal dataset, parametric modeling, SolidWorks, language-driven 3D design

TL;DR¶

This paper introduces SldprtNet — a large-scale multimodal dataset comprising 242K+ industrial CAD parts, where each sample includes a .sldprt/.step model, a 7-view composite image, a parametric modeling script (supporting lossless encoding/decoding of 13 command types), and a natural language description generated by Qwen2.5-VL. Baseline experiments demonstrate that multimodal input (image + text) outperforms text-only input for CAD generation.

Background & Motivation¶

Language-driven CAD modeling (Text-to-CAD) remains in its early stages. Existing CAD datasets face multiple limitations: (1) ShapeNet/ModelNet provide only meshes/point clouds without preserving parametric modeling history; (2) the ABC dataset contains B-Rep representations but lacks text annotations and modeling sequences; (3) Fusion 360 Gallery includes modeling history but is restricted to sketch+extrude operations with no language annotations; (4) DeepCAD/Text2CAD support only 2 command types and rely on synthetically generated text that may not align with the actual geometry. The core problem is the absence of a large-scale multimodal dataset that integrates 3D models, multi-view images, parametric instruction sequences, and natural language descriptions to support Text-to-CAD research.

Core Problem¶

How to construct a CAD dataset that satisfies the requirements of multimodality, bidirectional representability, semantic annotation, editability, and human readability? How to achieve lossless bidirectional conversion between CAD models and text to support dataset scalability? How much does multimodal input improve CAD generation performance?

Method¶

Overall Architecture¶

The data construction pipeline proceeds as follows: (1) collect ~680K .sldprt files from GrabCAD, McMaster-Carr, and FreeCAD; (2) filter and retain 242K+ models containing 13 feature types; (3) render 7-view images (6 orthographic + 1 isometric) via SolidWorks Macro and composite them into a single PNG; (4) extract parametric text (Encoder_txt) using an Encoder tool; (5) convert to .step format; (6) generate natural language descriptions (Des_txt) using Qwen2.5-VL-7B from the composite image and parametric text. Each final sample contains five aligned modalities: .sldprt, .step, .png, Encoder_txt, and Des_txt.

Key Designs¶

Encoder/Decoder Tool (13 CAD Commands): Implemented via the SolidWorks COM API. The Encoder traverses the Feature Tree of a .sldprt file and extracts, in modeling history order, the feature type, name, parent–child relationships, and detailed parameters (dimensions, constraints, sketch entities, etc.), outputting structured human-readable text. The Decoder reads this text to reconstruct the .sldprt file, realizing a lossless closed loop. The 13 supported commands — including 2D Sketch, Extrusion, Chamfer, Fillet, Linear Pattern, and Mirror Pattern — far exceed the 2 types (sketch+extrude) supported by DeepCAD and cover the majority of real industrial design operations.
7-View Composite Image: Six orthographic views (front, back, left, right, top, bottom) plus one isometric view are merged into a single image, fully capturing 3D geometric information. A key advantage is that, compared to seven separate images, a single composite image substantially reduces the number of input tokens during VLM inference, accelerating processing.
Natural Language Description via Qwen2.5-VL-7B: The model receives the composite image and parametric text as input and generates natural language descriptions of the part's appearance and function. Description generation for the full 242K+ samples required 368 GPU-hours on 12 A100 GPUs. The visual encoder enables the model to capture geometric details such as hole patterns, contour lines, and aspect ratios, producing descriptions that are closely aligned with the geometry. Manual verification was conducted to ensure quality.

Loss & Training¶

Baselines are constructed using Qwen2.5-7B and Qwen2.5-7B-VL, fine-tuned on a 50K subset. Evaluation metrics include Exact Match Score, BLEU Score, Command-Level F1, Tolerance Accuracy, and Partial Match Rate.

Key Experimental Results¶

Metric	Qwen2.5-7B (Text-only)	Qwen2.5-7B-VL (Image + Text)
Exact Match Score	0.0058	0.0099
BLEU Score	97.18	97.93
Command-Level F1	0.3247	0.3670
Partial Match Rate	0.5554	0.6162
Tolerance Accuracy	0.5016	0.4630

Ablation Study¶

Multimodal > Text-only: Exact Match improves by 71%, Command F1 by 13%, and Partial Match by 11%, validating the contribution of the visual modality to understanding geometric semantics and modeling logic.
Tolerance Accuracy slightly favors text-only: This may reflect overfitting to numerical parameters in the text-only model, which lacks structural semantic understanding.
Reasonable complexity distribution: Level 1 (1–5 features): 93K; Level 2 (6–10): 79K; Level 3 (11–100): 69K; Level 4 (100+): 1.2K — suitable for curriculum learning.

Comparison with Existing Datasets¶

Feature	SldprtNet	ABC	ShapeNet	DeepCAD	Text2CAD
Scale	242K	1M+	3M+	170K	170K
Parametric	✓	✓	×	✓	✓
Multi-view	✓	×	✓	×	×
Reconstructable	✓	×	×	×	×
Natural Language	✓	×	×	×	✓ (synthetic)
Command Types	13	—	—	2	2

Highlights & Insights¶

The Encoder/Decoder closed loop is the key contribution: Lossless bidirectional conversion between CAD and text makes the dataset extensible and supports verification of generated outputs. The paradigm of using "code as intermediate representation" is likely to become increasingly important in the Text-to-3D field.
Engineering trick of compositing 7 views to reduce tokens: Simple yet effective — merging multiple views into a single image avoids token explosion from multi-image inputs during VLM inference.
Near-zero Exact Match scores (0.006–0.01) highlight the difficulty of CAD generation: Even on structured, deterministic CAD command generation tasks, current LLMs achieve very low exact match rates, indicating substantial room for improvement in this area.
Real industrial parts vs. synthetic data: Compared to datasets such as Omni-CAD that rely on synthetic data, SldprtNet is sourced from real engineering components on GrabCAD and better reflects actual design requirements.

Limitations & Future Work¶

Insufficient evaluation metrics: No 3D geometry-level evaluation (e.g., Chamfer Distance, IoU) is included; generation quality is assessed solely through text matching, which cannot truly measure model performance.
Only a 50K subset is used for training: The effect of full 242K-scale training remains unknown.
SolidWorks dependency: The Encoder/Decoder is implemented via the SolidWorks COM API, binding it to proprietary software and hindering reproducibility and extension by the open-source community.
VLM-generated natural language descriptions: Although human verification is claimed, the feasibility of manually verifying 242K samples is questionable.
Conference classification ambiguity: The paper is labeled as ICRA 2026 but appears in the CVPR 2026 list, which may be a categorization error.

vs. DeepCAD/Text2CAD: The core gap lies in command diversity (13 vs. 2) and multimodal alignment. DeepCAD supports only sketch+extrude, whereas SldprtNet covers fillet, chamfer, pattern, and other common industrial operations, more closely reflecting real design practice.
vs. CAD-GPT/CAD-MLLM/CAD-Coder: These are model-side contributions; SldprtNet is a dataset-side contribution — the two are complementary. SldprtNet's encoder/decoder can provide richer training data for these models.
vs. ABC: ABC contains 1M+ B-Rep models but lacks text annotations and modeling sequences, making it unsuitable for language-driven generation.

Rating¶

Novelty: ⭐⭐⭐ The core contribution is the dataset and tooling; technical innovation is limited, but the work fills an important gap.
Experimental Thoroughness: ⭐⭐⭐ Only a baseline comparison (text-only vs. multimodal) is presented; comparisons with other CAD generation methods and 3D-level evaluation are absent.
Writing Quality: ⭐⭐⭐⭐ Dataset description is comprehensive and analysis is systematic, though some cited works (e.g., NeRF, Swin Transformer) are only tangentially related to CAD datasets.
Value: ⭐⭐ CAD generation is not a core focus area, but the methodology for multimodal dataset construction is a useful reference.