CVPR 2025 3D Vision 3D generation 3D Bundle Image Flux LoRA ControlNet text-to-3D mesh enhancement

Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation¶

Conference: CVPR 2025
arXiv: 2503.01370
Code: Project Page
Area: 3D Vision
Keywords: 3D generation, 3D Bundle Image, Flux, LoRA, ControlNet, text-to-3D, mesh enhancement

TL;DR¶

Formulates 3D asset generation as a 2D image generation task—fine-tuning the Flux DiT model to generate a "3D Bundle Image" (a collage of four-view RGB and normal maps), then reconstructing the 3D mesh via ISOMER, and extending support for 3D enhancement and editing through ControlNet.

Background & Motivation¶

Background: 3D content generation methods are divided into optimization-based approaches (such as the DreamFusion family, which are slow but general) and direct generation approaches (such as InstantMesh and CraftsMan, which are fast but rely on large-scale 3D data).

Limitations of Prior Work: - Optimization-based methods are time-consuming and prone to the Janus problem. - Direct generation methods rely heavily on 3D training data—of which 70% of the 10M samples in Objaverse-XL are of poor quality. - The scale of 2D data (billions in LAION-5B) vastly exceeds that of 3D data, yet the 3D priors of 2D diffusion models remain underutilized. - Existing 2D-to-3D methods (such as the Switcher-style separate RGB/normal generation) modify the input/output structure of pre-trained models, thereby weakening their generalization capability.

Key Challenge: The scarcity of high-quality 3D data vs. the rich image priors possessed by 2D diffusion models. How can the knowledge of pre-trained 2D models be maximally repurposed for 3D generation?

Goal: To redirect 2D diffusion models to 3D generation in the simplest manner, while preserving their native generalization capabilities and compatibility with technologies like ControlNet.

Method¶

Overall Architecture¶

Data Preparation: Renders 3D objects into four-view RGB and normal maps, combining them into a single "3D Bundle Image".
Kiss3DGen-Base: Fine-tunes the Flux model using LoRA to generate the 3D Bundle Image.
3D Reconstruction: Reconstructs a textured mesh from the 3D Bundle Image using ISOMER.
Kiss3DGen-ControlNet: Extends ControlNet to support 3D enhancement, editing, and image-to-3D.

Key Designs¶

1. 3D Bundle Image Representation - Function: Renders a 3D object into 4 orthogonal views (spaced at 90° azimuth and 5° elevation) of RGB and normal maps, collated into a single 2D image. - Mechanism: The 3D Bundle Image is essentially a 2D image, inherently compatible with the input and output structures of pre-trained diffusion models. The attention blocks of DiT naturally excel at capturing long-range dependencies across different views and between RGB and normal maps. - Design Motivation: Compared to the Switcher mechanism (generating RGB and normal maps separately), the 3D Bundle Image ensures RGB-normal consistency within a single-pass generation. Ablation studies confirm that the Switcher mechanism fails to maintain consistency between the two modalities.

2. GPT-4V Caption Tagging - Function: Uses GPT-4V to generate detailed text descriptions for the RGB portion of each 3D Bundle Image, including color, shape, and surface properties. - Mechanism: Rich text descriptions provide additional semantic supervision signals, allowing the model to learn the correspondences between text and 3D geometry/appearance. - Design Motivation: Preserves the text-conditional generation capabilities of text-to-image models, which is foundational to text-to-3D, while allowing the model to leverage text-image alignment knowledge learned during Flux pre-training.

3. ControlNet Extension - Function: Trains ControlNet-Tile and ControlNet-Normal/Canny for 3D enhancement and editing. It introduces two hyperparameters: \(\lambda_1\) (ControlNet strength) and \(\lambda_2\) (ratio of effective steps). - Mechanism: Low-quality mesh \(\rightarrow\) rendering the 3D Bundle Image \(\rightarrow\) ControlNet enhancement \(\rightarrow\) ISOMER reconstruction. Descriptions are automatically generated by Florence-2 during enhancement, and custom-defined by users during editing. - Design Motivation: Because Kiss3DGen is inherently a diffusion model, it is naturally compatible with various diffusion techniques (ControlNet, SDEdit, etc.) without requiring architectural modifications.

Loss & Training¶

Model: Flux.1-dev + LoRA (rank=128)
Data: 147K high-quality 3D objects (curated from Objaverse with manually corrected orientations) + an optional 4K animated character models
Training: 8× A800 80GB, 3 days, 16 epochs, batch=4, LR=\(8\times10^{-4}\), bf16 precision
Rendering: Blender, camera distance 4.5, FoV 30°, resolution 512×512
Inference: Generates the 3D Bundle Image first, followed by LRM initialization + ISOMER optimization to obtain the mesh

Key Experimental Results¶

Main Results—Text-to-3D¶

Method	Data Size	CLIP↑	Quality↑	Aesthetic↑
3DTopia	320K	0.694	2.145	1.538
Direct2.5	500K	0.773	2.158	1.459
Hunyuan3D-1.0	N/A	0.792	2.517	1.504
Kiss3DGen-Base	147K	0.837	2.700	1.800
Kiss3DGen-50K	50K	0.804	2.716	1.601

Comprehensive outperformance using fewer data (147K vs. 320K-500K).

Main Results—Image-to-3D¶

Method	CD↓	F-Score↑	PSNR↑	SSIM↑	LPIPS↓
CraftsMan	0.178	0.739	N/A	N/A	N/A
Unique3D	0.217	0.654	19.24	0.898	0.127
Hunyuan3D-1.0	0.153	0.768	16.65	0.885	0.123
Kiss3DGen	0.149	0.769	20.35	0.902	0.116

Achieves optimal performance in both 3D geometry and 2D visual quality.

Ablation Study¶

Settings	Multi-view Consistency	RGB-Normal Consistency
3D Bundle Image (Ours)	✓ High	✓ High
Switcher Mechanism	Medium	✗ Low (RGB and normals inconsistent)

Key Findings¶

3D Bundle Image Outperforms Switcher: The attention mechanism of DiT ensures multi-view and RGB-normal consistency within a single-pass generation.
Extremely High Data Efficiency: Models trained on 50K data already yield competitive results, and 147K yields SOTA performance, which is far less than the 320K-500K required by competing methods.
Surpassing "Ground Truth" Quality: Metrics in Quality and Aesthetics even exceed those of real rendered images—benefiting from the high-quality image priors of pre-trained Flux.
Natural and Effective ControlNet Extension: 3D enhancement and editing do not require any extra architectural design, directly reusing the 2D software stack.

Highlights & Insights¶

The design philosophy of "Keep It Simple and Straightforward" is maintained throughout: reusing 2D models for 3D in the simplest possible manner.
3D Bundle Image is an elegant representation choice: completely encoding 3D info into a 2D image, without modifying the pre-trained model structure.
The advantage in data efficiency stems from knowledge transfer of the pre-trained model—the image priors learned by Flux are successfully transferred to 3D.
Compatibility with ControlNet unlocks rich applications such as 3D enhancement, editing, and stylization.

Limitations & Future Work¶

4 views + normal maps may be insufficient to represent complex geometries (e.g., highly self-occluded objects).
Relies on ISOMER/LRM for 3D reconstruction, making the reconstruction quality a potential bottleneck.
Only supports single-object generation; scene-level 3D generation is not supported.
The optimal representation of normal maps can be further explored.
Generation resolution is constrained by the 512×512 rendering framework.
The hyperparameters \(\lambda_1, \lambda_2\) of ControlNet require manual tuning for different tasks.

DreamFusion / SDS: Classics in optimization-based 3D generation, which Kiss3DGen replaces with direct generation.
InstantMesh / LRM: Large reconstruction model series, which can be used complementarily with Kiss3DGen.
Flux (DiT): The choice of foundation model, whose attention mechanism is crucial for multi-view consistency.
ISOMER / NeuS: Core tools for reconstructing meshes from multi-view RGB and normal maps.
Unique3D: A similar two-stage pipeline (generating RGB and normals separately), but Kiss3DGen's joint generation achieves better consistency.

Rating¶

⭐⭐⭐⭐ — Simple yet practical approach, enabling Flux to act as a 3D generator via mere LoRA fine-tuning with high data efficiency. The compatibility with ControlNet opens up rich applications. However, there remains room for improvement in 3D representation and reconstruction precision; it is an engineering-driven, practical work.