Skip to content

Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation

Conference: CVPR 2025
arXiv: 2503.01370
Code: Project Page
Area: 3D Vision
Keywords: 3D generation, 3D Bundle Image, Flux, LoRA, ControlNet, text-to-3D, mesh enhancement

TL;DR

Formulates 3D asset generation as a 2D image generation task—fine-tuning the Flux DiT model to generate a "3D Bundle Image" (a collage of four-view RGB and normal maps), then reconstructing the 3D mesh via ISOMER, and extending support for 3D enhancement and editing through ControlNet.

Background & Motivation

Background: 3D content generation methods are divided into optimization-based approaches (such as the DreamFusion family, which are slow but general) and direct generation approaches (such as InstantMesh and CraftsMan, which are fast but rely on large-scale 3D data).

Limitations of Prior Work: - Optimization-based methods are time-consuming and prone to the Janus problem. - Direct generation methods rely heavily on 3D training data—of which 70% of the 10M samples in Objaverse-XL are of poor quality. - The scale of 2D data (billions in LAION-5B) vastly exceeds that of 3D data, yet the 3D priors of 2D diffusion models remain underutilized. - Existing 2D-to-3D methods (such as the Switcher-style separate RGB/normal generation) modify the input/output structure of pre-trained models, thereby weakening their generalization capability.

Key Challenge: The scarcity of high-quality 3D data vs. the rich image priors possessed by 2D diffusion models. How can the knowledge of pre-trained 2D models be maximally repurposed for 3D generation?

Goal: To redirect 2D diffusion models to 3D generation in the simplest manner, while preserving their native generalization capabilities and compatibility with technologies like ControlNet.

Method

Overall Architecture

  1. Data Preparation: Renders 3D objects into four-view RGB and normal maps, combining them into a single "3D Bundle Image".
  2. Kiss3DGen-Base: Fine-tunes the Flux model using LoRA to generate the 3D Bundle Image.
  3. 3D Reconstruction: Reconstructs a textured mesh from the 3D Bundle Image using ISOMER.
  4. Kiss3DGen-ControlNet: Extends ControlNet to support 3D enhancement, editing, and image-to-3D.

Key Designs

1. 3D Bundle Image Representation - Function: Renders a 3D object into 4 orthogonal views (spaced at 90° azimuth and 5° elevation) of RGB and normal maps, collated into a single 2D image. - Mechanism: The 3D Bundle Image is essentially a 2D image, inherently compatible with the input and output structures of pre-trained diffusion models. The attention blocks of DiT naturally excel at capturing long-range dependencies across different views and between RGB and normal maps. - Design Motivation: Compared to the Switcher mechanism (generating RGB and normal maps separately), the 3D Bundle Image ensures RGB-normal consistency within a single-pass generation. Ablation studies confirm that the Switcher mechanism fails to maintain consistency between the two modalities.

2. GPT-4V Caption Tagging - Function: Uses GPT-4V to generate detailed text descriptions for the RGB portion of each 3D Bundle Image, including color, shape, and surface properties. - Mechanism: Rich text descriptions provide additional semantic supervision signals, allowing the model to learn the correspondences between text and 3D geometry/appearance. - Design Motivation: Preserves the text-conditional generation capabilities of text-to-image models, which is foundational to text-to-3D, while allowing the model to leverage text-image alignment knowledge learned during Flux pre-training.

3. ControlNet Extension - Function: Trains ControlNet-Tile and ControlNet-Normal/Canny for 3D enhancement and editing. It introduces two hyperparameters: \(\lambda_1\) (ControlNet strength) and \(\lambda_2\) (ratio of effective steps). - Mechanism: Low-quality mesh \(\rightarrow\) rendering the 3D Bundle Image \(\rightarrow\) ControlNet enhancement \(\rightarrow\) ISOMER reconstruction. Descriptions are automatically generated by Florence-2 during enhancement, and custom-defined by users during editing. - Design Motivation: Because Kiss3DGen is inherently a diffusion model, it is naturally compatible with various diffusion techniques (ControlNet, SDEdit, etc.) without requiring architectural modifications.

Loss & Training

  • Model: Flux.1-dev + LoRA (rank=128)
  • Data: 147K high-quality 3D objects (curated from Objaverse with manually corrected orientations) + an optional 4K animated character models
  • Training: 8× A800 80GB, 3 days, 16 epochs, batch=4, LR=\(8\times10^{-4}\), bf16 precision
  • Rendering: Blender, camera distance 4.5, FoV 30°, resolution 512×512
  • Inference: Generates the 3D Bundle Image first, followed by LRM initialization + ISOMER optimization to obtain the mesh

Key Experimental Results

Main Results—Text-to-3D

Method Data Size CLIP↑ Quality↑ Aesthetic↑
3DTopia 320K 0.694 2.145 1.538
Direct2.5 500K 0.773 2.158 1.459
Hunyuan3D-1.0 N/A 0.792 2.517 1.504
Kiss3DGen-Base 147K 0.837 2.700 1.800
Kiss3DGen-50K 50K 0.804 2.716 1.601

Comprehensive outperformance using fewer data (147K vs. 320K-500K).

Main Results—Image-to-3D

Method CD↓ F-Score↑ PSNR↑ SSIM↑ LPIPS↓
CraftsMan 0.178 0.739 N/A N/A N/A
Unique3D 0.217 0.654 19.24 0.898 0.127
Hunyuan3D-1.0 0.153 0.768 16.65 0.885 0.123
Kiss3DGen 0.149 0.769 20.35 0.902 0.116

Achieves optimal performance in both 3D geometry and 2D visual quality.

Ablation Study

Settings Multi-view Consistency RGB-Normal Consistency
3D Bundle Image (Ours) ✓ High ✓ High
Switcher Mechanism Medium ✗ Low (RGB and normals inconsistent)

Key Findings

  1. 3D Bundle Image Outperforms Switcher: The attention mechanism of DiT ensures multi-view and RGB-normal consistency within a single-pass generation.
  2. Extremely High Data Efficiency: Models trained on 50K data already yield competitive results, and 147K yields SOTA performance, which is far less than the 320K-500K required by competing methods.
  3. Surpassing "Ground Truth" Quality: Metrics in Quality and Aesthetics even exceed those of real rendered images—benefiting from the high-quality image priors of pre-trained Flux.
  4. Natural and Effective ControlNet Extension: 3D enhancement and editing do not require any extra architectural design, directly reusing the 2D software stack.

Highlights & Insights

  • The design philosophy of "Keep It Simple and Straightforward" is maintained throughout: reusing 2D models for 3D in the simplest possible manner.
  • 3D Bundle Image is an elegant representation choice: completely encoding 3D info into a 2D image, without modifying the pre-trained model structure.
  • The advantage in data efficiency stems from knowledge transfer of the pre-trained model—the image priors learned by Flux are successfully transferred to 3D.
  • Compatibility with ControlNet unlocks rich applications such as 3D enhancement, editing, and stylization.

Limitations & Future Work

  • 4 views + normal maps may be insufficient to represent complex geometries (e.g., highly self-occluded objects).
  • Relies on ISOMER/LRM for 3D reconstruction, making the reconstruction quality a potential bottleneck.
  • Only supports single-object generation; scene-level 3D generation is not supported.
  • The optimal representation of normal maps can be further explored.
  • Generation resolution is constrained by the 512×512 rendering framework.
  • The hyperparameters \(\lambda_1, \lambda_2\) of ControlNet require manual tuning for different tasks.
  • DreamFusion / SDS: Classics in optimization-based 3D generation, which Kiss3DGen replaces with direct generation.
  • InstantMesh / LRM: Large reconstruction model series, which can be used complementarily with Kiss3DGen.
  • Flux (DiT): The choice of foundation model, whose attention mechanism is crucial for multi-view consistency.
  • ISOMER / NeuS: Core tools for reconstructing meshes from multi-view RGB and normal maps.
  • Unique3D: A similar two-stage pipeline (generating RGB and normals separately), but Kiss3DGen's joint generation achieves better consistency.

Rating

⭐⭐⭐⭐ — Simple yet practical approach, enabling Flux to act as a 3D generator via mere LoRA fine-tuning with high data efficiency. The compatibility with ControlNet opens up rich applications. However, there remains room for improvement in 3D representation and reconstruction precision; it is an engineering-driven, practical work.