Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation¶
Conference: CVPR 2025
arXiv: 2503.01370
Code: Project Page
Area: 3D Vision
Keywords: 3D generation, 3D Bundle Image, Flux, LoRA, ControlNet, text-to-3D, mesh enhancement
TL;DR¶
Formulates 3D asset generation as a 2D image generation task—fine-tuning the Flux DiT model to generate a "3D Bundle Image" (a collage of four-view RGB and normal maps), then reconstructing the 3D mesh via ISOMER, and extending support for 3D enhancement and editing through ControlNet.
Background & Motivation¶
Background: 3D content generation methods are divided into optimization-based approaches (such as the DreamFusion family, which are slow but general) and direct generation approaches (such as InstantMesh and CraftsMan, which are fast but rely on large-scale 3D data).
Limitations of Prior Work: - Optimization-based methods are time-consuming and prone to the Janus problem. - Direct generation methods rely heavily on 3D training data—of which 70% of the 10M samples in Objaverse-XL are of poor quality. - The scale of 2D data (billions in LAION-5B) vastly exceeds that of 3D data, yet the 3D priors of 2D diffusion models remain underutilized. - Existing 2D-to-3D methods (such as the Switcher-style separate RGB/normal generation) modify the input/output structure of pre-trained models, thereby weakening their generalization capability.
Key Challenge: The scarcity of high-quality 3D data vs. the rich image priors possessed by 2D diffusion models. How can the knowledge of pre-trained 2D models be maximally repurposed for 3D generation?
Goal: To redirect 2D diffusion models to 3D generation in the simplest manner, while preserving their native generalization capabilities and compatibility with technologies like ControlNet.
Method¶
Overall Architecture¶
- Data Preparation: Renders 3D objects into four-view RGB and normal maps, combining them into a single "3D Bundle Image".
- Kiss3DGen-Base: Fine-tunes the Flux model using LoRA to generate the 3D Bundle Image.
- 3D Reconstruction: Reconstructs a textured mesh from the 3D Bundle Image using ISOMER.
- Kiss3DGen-ControlNet: Extends ControlNet to support 3D enhancement, editing, and image-to-3D.
Key Designs¶
1. 3D Bundle Image Representation - Function: Renders a 3D object into 4 orthogonal views (spaced at 90° azimuth and 5° elevation) of RGB and normal maps, collated into a single 2D image. - Mechanism: The 3D Bundle Image is essentially a 2D image, inherently compatible with the input and output structures of pre-trained diffusion models. The attention blocks of DiT naturally excel at capturing long-range dependencies across different views and between RGB and normal maps. - Design Motivation: Compared to the Switcher mechanism (generating RGB and normal maps separately), the 3D Bundle Image ensures RGB-normal consistency within a single-pass generation. Ablation studies confirm that the Switcher mechanism fails to maintain consistency between the two modalities.
2. GPT-4V Caption Tagging - Function: Uses GPT-4V to generate detailed text descriptions for the RGB portion of each 3D Bundle Image, including color, shape, and surface properties. - Mechanism: Rich text descriptions provide additional semantic supervision signals, allowing the model to learn the correspondences between text and 3D geometry/appearance. - Design Motivation: Preserves the text-conditional generation capabilities of text-to-image models, which is foundational to text-to-3D, while allowing the model to leverage text-image alignment knowledge learned during Flux pre-training.
3. ControlNet Extension - Function: Trains ControlNet-Tile and ControlNet-Normal/Canny for 3D enhancement and editing. It introduces two hyperparameters: \(\lambda_1\) (ControlNet strength) and \(\lambda_2\) (ratio of effective steps). - Mechanism: Low-quality mesh \(\rightarrow\) rendering the 3D Bundle Image \(\rightarrow\) ControlNet enhancement \(\rightarrow\) ISOMER reconstruction. Descriptions are automatically generated by Florence-2 during enhancement, and custom-defined by users during editing. - Design Motivation: Because Kiss3DGen is inherently a diffusion model, it is naturally compatible with various diffusion techniques (ControlNet, SDEdit, etc.) without requiring architectural modifications.
Loss & Training¶
- Model: Flux.1-dev + LoRA (rank=128)
- Data: 147K high-quality 3D objects (curated from Objaverse with manually corrected orientations) + an optional 4K animated character models
- Training: 8× A800 80GB, 3 days, 16 epochs, batch=4, LR=\(8\times10^{-4}\), bf16 precision
- Rendering: Blender, camera distance 4.5, FoV 30°, resolution 512×512
- Inference: Generates the 3D Bundle Image first, followed by LRM initialization + ISOMER optimization to obtain the mesh
Key Experimental Results¶
Main Results—Text-to-3D¶
| Method | Data Size | CLIP↑ | Quality↑ | Aesthetic↑ |
|---|---|---|---|---|
| 3DTopia | 320K | 0.694 | 2.145 | 1.538 |
| Direct2.5 | 500K | 0.773 | 2.158 | 1.459 |
| Hunyuan3D-1.0 | N/A | 0.792 | 2.517 | 1.504 |
| Kiss3DGen-Base | 147K | 0.837 | 2.700 | 1.800 |
| Kiss3DGen-50K | 50K | 0.804 | 2.716 | 1.601 |
Comprehensive outperformance using fewer data (147K vs. 320K-500K).
Main Results—Image-to-3D¶
| Method | CD↓ | F-Score↑ | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|---|---|
| CraftsMan | 0.178 | 0.739 | N/A | N/A | N/A |
| Unique3D | 0.217 | 0.654 | 19.24 | 0.898 | 0.127 |
| Hunyuan3D-1.0 | 0.153 | 0.768 | 16.65 | 0.885 | 0.123 |
| Kiss3DGen | 0.149 | 0.769 | 20.35 | 0.902 | 0.116 |
Achieves optimal performance in both 3D geometry and 2D visual quality.
Ablation Study¶
| Settings | Multi-view Consistency | RGB-Normal Consistency |
|---|---|---|
| 3D Bundle Image (Ours) | ✓ High | ✓ High |
| Switcher Mechanism | Medium | ✗ Low (RGB and normals inconsistent) |
Key Findings¶
- 3D Bundle Image Outperforms Switcher: The attention mechanism of DiT ensures multi-view and RGB-normal consistency within a single-pass generation.
- Extremely High Data Efficiency: Models trained on 50K data already yield competitive results, and 147K yields SOTA performance, which is far less than the 320K-500K required by competing methods.
- Surpassing "Ground Truth" Quality: Metrics in Quality and Aesthetics even exceed those of real rendered images—benefiting from the high-quality image priors of pre-trained Flux.
- Natural and Effective ControlNet Extension: 3D enhancement and editing do not require any extra architectural design, directly reusing the 2D software stack.
Highlights & Insights¶
- The design philosophy of "Keep It Simple and Straightforward" is maintained throughout: reusing 2D models for 3D in the simplest possible manner.
- 3D Bundle Image is an elegant representation choice: completely encoding 3D info into a 2D image, without modifying the pre-trained model structure.
- The advantage in data efficiency stems from knowledge transfer of the pre-trained model—the image priors learned by Flux are successfully transferred to 3D.
- Compatibility with ControlNet unlocks rich applications such as 3D enhancement, editing, and stylization.
Limitations & Future Work¶
- 4 views + normal maps may be insufficient to represent complex geometries (e.g., highly self-occluded objects).
- Relies on ISOMER/LRM for 3D reconstruction, making the reconstruction quality a potential bottleneck.
- Only supports single-object generation; scene-level 3D generation is not supported.
- The optimal representation of normal maps can be further explored.
- Generation resolution is constrained by the 512×512 rendering framework.
- The hyperparameters \(\lambda_1, \lambda_2\) of ControlNet require manual tuning for different tasks.
Related Work & Insights¶
- DreamFusion / SDS: Classics in optimization-based 3D generation, which Kiss3DGen replaces with direct generation.
- InstantMesh / LRM: Large reconstruction model series, which can be used complementarily with Kiss3DGen.
- Flux (DiT): The choice of foundation model, whose attention mechanism is crucial for multi-view consistency.
- ISOMER / NeuS: Core tools for reconstructing meshes from multi-view RGB and normal maps.
- Unique3D: A similar two-stage pipeline (generating RGB and normals separately), but Kiss3DGen's joint generation achieves better consistency.
Rating¶
⭐⭐⭐⭐ — Simple yet practical approach, enabling Flux to act as a 3D generator via mere LoRA fine-tuning with high data efficiency. The compatibility with ControlNet opens up rich applications. However, there remains room for improvement in 3D representation and reconstruction precision; it is an engineering-driven, practical work.