ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph¶

Conference: CVPR 2026 arXiv: 2603.09266 Code: GitHub Area: 3D Vision Keywords: Text-to-3D, Industrial 3D Generation, LoRA Distillation, Hypergraph Geometric Consistency, 3D Gaussian Splatting

TL;DR¶

This paper proposes ForgeDreamer, a framework that addresses domain semantic adaptation in industrial settings via multi-expert LoRA teacher-student distillation, and achieves high-order geometric consistency constraints through cross-view hypergraph geometric enhancement, outperforming existing methods on industrial text-to-3D generation tasks.

Background & Motivation¶

Text-to-3D generation methods (e.g., DreamFusion, ProlificDreamer) have achieved notable progress on natural scenes, but face two critical bottlenecks in industrial applications:

Domain Adaptation Challenge: Pre-trained diffusion models are trained on natural scenes and lack sufficient semantic understanding of industrial components (screws, nuts, electronic parts, etc.). Conventional LoRA fusion strategies introduce knowledge interference when merging multiple category-specific adapters.

Insufficient Geometric Reasoning: Existing methods rely on pairwise consistency constraints and cannot capture high-order structural dependencies required by precision industrial manufacturing, leading to artifacts in details such as thread textures and connector interfaces.

Existing industrial 3D datasets (e.g., MVTec 3D-AD, Real-IAD) have limited viewpoints and inconsistent imaging conditions, making them unsuitable for text-to-3D generation. The authors therefore construct a controlled multi-view industrial dataset.

Method¶

Overall Architecture¶

ForgeDreamer is built upon 3D Gaussian Splatting and features two core modules optimized jointly: the Multi-Expert LoRA Ensemble improves semantic understanding, followed by the Cross-View Hypergraph for enhanced geometric precision. The total loss is:

\[\mathcal{L}_{\text{total}} = \lambda_{\text{ISM}} \mathcal{L}_{\text{ISM}} + \lambda_{\text{MVHG}} \mathcal{L}_{\text{MVHG}}\]

Key Designs¶

Multi-Expert LoRA Teacher-Student Distillation Framework: Separate LoRA experts (Teachers) are trained for each category of industrial components, and their knowledge is integrated into a unified student model via two-stage teacher-student distillation.
- Stage 1: Only the student text encoder is trained while the UNet is frozen, preventing catastrophic forgetting. The loss includes text feature alignment \(\mathcal{L}_{\text{text}} = \sum_l \alpha_l \cdot \text{MSE}(\text{Pool}(\boldsymbol{f}_T^l), \text{Pool}(\boldsymbol{f}_S^l))\) and a noise prediction loss.
- Stage 2: Both the text encoder and UNet are optimized jointly, alternating between noise prediction and feature alignment, with UNet feature distillation \(\mathcal{L}_{\text{unet}} = \sum_m \beta_m \cdot \text{MSE}(\boldsymbol{u}_T^m, \boldsymbol{u}_S^m)\).
- A round-robin strategy ensures balanced knowledge transfer from all Teachers.
- Design Motivation: Naive additive fusion \(\boldsymbol{W}_{\text{combined}} = \boldsymbol{W}_{\text{base}} + \sum_i \boldsymbol{W}_{\text{LoRA}}^{(i)}\) causes knowledge interference; distillation instead learns a common feature space compatible with all expert knowledge.
Cross-View Hypergraph Geometric Enhancement (CVGCM): Geometric consistency is formulated as a hypergraph learning problem to capture high-order structural dependencies across multiple views.
- Multi-view latent representations \(\boldsymbol{Z} = \{\boldsymbol{z}^{(i)} \in \mathbb{R}^{H \times W \times C}\}_{i=1}^N\) are flattened and concatenated into a node feature matrix \(\boldsymbol{F} \in \mathbb{R}^{(N \cdot H \cdot W) \times C}\).
- A hypergraph \(\mathcal{H} = (\mathcal{V}, \mathcal{E})\) is constructed based on cosine feature similarity, where each hyperedge connects the TopK most similar nodes: \(e_i = \{v_j : v_j \in \text{TopK}(\text{sim}(\boldsymbol{f}_i, \boldsymbol{f}_j), k)\}\).
- A Hypergraph Neural Network performs message passing aggregation: \(\boldsymbol{h}_v^{(l+1)} = \sigma(\boldsymbol{W}^{(l)} \sum_{e \in \mathcal{E}(v)} \frac{1}{|\mathcal{E}(v)|} \text{AGG}(\{\boldsymbol{h}_u^{(l)} : u \in e\}))\).
- Design Motivation: Traditional pairwise constraints (e.g., interval score matching in ISM) only model two-way relations and cannot capture the high-order structural relationships requiring simultaneous multi-view consistency for industrial components.
HSV Mask-Guided MVHG Loss: After hypergraph processing, an HSV mask focuses the loss on the target object region, computed in the cross-view feature space: \(\mathcal{L}_{\text{MVHG}} = \frac{1}{|\mathcal{M}|} \sum_{(h,w) \in \mathcal{M}} \|\boldsymbol{F}_z^{\text{masked}}[h,w,:] - \boldsymbol{F}_{\text{pred}}^{\text{masked}}[h,w,:]\|_2^2\)

Loss & Training¶

Distillation training follows a two-stage strategy: Stage 1 stabilizes the semantic foundation; Stage 2 performs joint optimization.
The 3D generation stage employs joint optimization with both ISM and MVHG losses.
At inference time, the pipeline iterates: multi-view rendering → CVGCM processing → update of 3DGS parameters.

Key Experimental Results¶

Main Results¶

The custom industrial dataset comprises 10 categories (6 mechanical parts + 4 electronic components), with 20 multi-view high-resolution images per category.

Method	Avg. Time	Avg. T3Bench Quality Score
ProlificDreamer (w/o LoRA)	~10 hours	25.13
DreamFusion (w/o LoRA)	6 hours	41.91
DreamFusion (w/ LoRA)	6 hours	44.83
RichDreamer (w/o LoRA)	120 min	28.27
LucidDreamer (w/o LoRA)	110 min	47.10
LucidDreamer (w/ LoRA)	110 min	46.75
ForgeDreamer (Ours)	190 min	50.88

Ablation Study¶

Configuration	2 LoRAs	4 LoRAs	6 LoRAs	Note
Addition Fusion	0.938	0.814	0.633	CLIP cosine similarity drops sharply as LoRA count increases
Distillation Fusion	0.965	0.949	0.952	Distillation maintains stable concept retention

Key Findings¶

Distillation fusion maintains concept retention scores above 0.95 as the number of LoRAs increases, while additive fusion degrades to 0.633.
The MVHG loss significantly improves geometric fidelity and spatial consistency, eliminating cross-view topological inconsistencies and fine-structure distortions.
The combination of distilled LoRA and MVHG loss yields the best results, with the two components working synergistically.

Highlights & Insights¶

From Pairwise to Higher-Order: Elevating geometric consistency from pairwise constraints to higher-order hypergraph constraints represents an elegant paradigm shift.
Distillation over Addition: The teacher-student distillation strategy for multiple LoRAs resolves knowledge interference more effectively than simple additive fusion.
Semantics First: The progressive design logic—first improving semantic understanding, then refining geometric precision—is conceptually clear and well-motivated.

Limitations & Future Work¶

The custom dataset is relatively small (only 20 images per category), and generalization remains to be validated.
The 190-minute generation time is still substantial and requires further acceleration for practical industrial deployment.
Hypergraph construction based on TopK feature similarity may fail for views with highly disparate appearances.
Validation is limited to industrial scenes; the impact on natural scene generation has not been explored.

The SDS/ISM objectives from DreamFusion/LucidDreamer form the foundational methodology for text-to-3D, upon which this paper builds systematic improvements for industrial settings.
The application of hypergraph neural networks to 3D generation is noteworthy; Hyper-3DG explores a similar direction.
The multi-LoRA distillation framework may also be applicable to other generative tasks requiring adaptation across multiple domains.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of hypergraph geometric consistency and multi-expert LoRA distillation is original.
Experimental Thoroughness: ⭐⭐⭐ Dataset scale is limited, and comparison with more baselines is lacking.
Writing Quality: ⭐⭐⭐⭐ Method description is clear with well-structured logical progression.
Value: ⭐⭐⭐ Industrial 3D generation is a worthwhile direction, though the application scope is relatively narrow.
Value: Pending