StdGEN: Semantic-Decomposed 3D Character Generation from Single Images¶
Conference: CVPR 2025
arXiv: 2411.05738
Code: https://stdgen.github.io
Area: 3D Vision / Image Generation
Keywords: 3D Character Generation, Semantic Decomposition, Large Reconstruction Model, Multi-view Diffusion, Anime Character
TL;DR¶
StdGEN proposes an efficient pipeline to generate high-quality, semantically decomposed (separate body, clothing, and hair) 3D characters from a single image. The core is a Semantic-aware Large Reconstruction Model (S-LRM) that achieves feedforward joint geometry-color-semantic reconstruction by incorporating a semantic field into NeRF/SDF, generating layered 3D characters ready for games/animations within 3 minutes.
Background & Motivation¶
Background: Generating 3D characters from a single image is in high demand in virtual reality, gaming, and film production. Existing methods can be categorized into two groups: (1) SDS optimization methods (e.g., DreamFusion-style) that refine 3D generation from 2D diffusion models via Score Distillation, but suffer from long optimization times and coarse, high-contrast textures; (2) feedforward reconstruction methods (e.g., CharacterGen) that utilize multi-view diffusion + Large Reconstruction Models for rapid generation, but can only produce indecomposable, integrated meshes.
Limitations of Prior Work: In practical applications, generating an integrated character is far from sufficient—games and animations require the separation of the character's body, clothes, and hair for independent editing, outfit changes, and physical simulation. The watertight meshes generated by existing methods require extensive manual post-processing to separate the components. Moreover, methods like CharacterGen lack geometry and texture details, leading to limited quality in facial and clothing features.
Key Challenge: How to achieve high-quality 3D reconstruction and accurate semantic decomposition simultaneously while maintaining high efficiency? SDS methods yield high quality but are too slow, whereas feedforward methods are fast but cannot perform decomposition and have limited quality.
Goal: To design an efficient pipeline that generates high-quality, semantically decomposed A-pose 3D characters from a single reference image in any pose within 3 minutes, where each component (body, clothing, hair) can be used independently.
Key Insight: By introducing semantic attributes into the Large Reconstruction Model, the model is enabled to simultaneously output geometry, color, and semantic fields in a single feedforward inference pass. Then, a differentiable multilayer semantic surface extraction is employed for end-to-end decomposed reconstruction.
Core Idea: An additional semantic decoder is introduced into the tri-plane representation of LRM. A semantic-equivalent NeRF/SDF is designed to extract equivalent implicit fields based on semantics. Combined with a multi-layer hierarchical supervision strategy from the outside-in, the model learns the 3D internal structures using only 2D annotations.
Method¶
Overall Architecture¶
StdGEN consists of three stages: (1) Multi-view Generation and Pose Standardization: Given a reference image in an arbitrary pose, it is first transformed into an A-pose image using Stable Diffusion enhanced with ReferenceNet, and then an improved Era3D generates high-resolution RGB and normal maps from 6 viewpoints. (2) S-LRM Feedforward Reconstruction: The multi-view images are fed into the Semantic-aware Large Reconstruction Model to reconstruct three fields (geometry (density/SDF), color, and semantics) in a single forward pass, and the decomposed multilayer surfaces are extracted via semantic-equivalent SDF. (3) Multilayer Mesh Refinement: The meshes are optimized layer by layer in an inside-out order (body \(\rightarrow\) clothing \(\rightarrow\) hair), using diffusion-generated normal maps to guide geometric refinement, followed by coloring via multi-view back-projection.
Key Designs¶
-
Semantic-aware Large Reconstruction Model (S-LRM):
- Function: Reconstructing semantic-aware 3D implicit fields in a feedforward manner from multi-view images.
- Mechanism: On top of the ViT encoder + image-to-triplane transformer + feature decoder from InstantMesh, an additional semantic decoder is introduced with the same structure as the density/color decoders. For each spatial point \(\mathbf{x}\), it outputs density \(\sigma\), color \(c\), and semantic distribution \(s\). The key innovation is the semantic-equivalent NeRF: for the pixel color of semantic \(s\), it is computed as \(\hat{C}_s(\mathbf{r}) = \sum_i T_{s,i} p_{s,i} \alpha_i c_i\), where \(T_{s,i} = \prod_{j=1}^{i-1}(1-\alpha_j p_{s,j})\), where positions with zero semantic probability are fully transparent under that semantic category. To adapt to SDF surface extraction, a semantic-equivalent SDF is designed: \(f_{i,s} = \max(f_i, (\max_{r \neq s} p_{i,r}) - p_{i,s})\), ensuring that surfaces are extracted only in regions where the corresponding semantics dominate.
- Design Motivation: Existing LRMs can only generate a single integrated model. By introducing a semantic field and designing equivalent implicit field extraction formulas, the model can differentiably extract independent meshes of each semantic component from a unified representation, which is fully compatible with the FlexiCubes surface extraction pipeline.
-
Multilayer Semantic Supervision Training Strategy:
- Function: Enabling the model to learn the internal structures of 3D characters (such as the body beneath clothes) using only 2D semantic annotations.
- Mechanism: Training is conducted in three stages. Stage 1 (NeRF + Single-layer Semantics): Training on the tri-plane NeRF representation and introducing semantic rendering \(\hat{S}(\mathbf{r}) = \sum_i T_i p_i \alpha_i\) supervised by cross-entropy loss on surface semantics. During training, pre-trained InstantMesh weights are frozen, and only the newly added LoRA adapters and the semantic decoder are trained. Stage 2 (NeRF + Multi-layer Semantics): Rendering and supervising by progressively peeling away outer layer semantics—rendering the full character first, then masking hair to supervise only body+clothing, and finally masking clothing to supervise only the body. This forces the model to learn the geometry and color of occluded parts. Stage 3 (Mesh + Multi-layer Semantics): Switching to FlexiCubes mesh representation for high-resolution training, extracting layered meshes using semantic-equivalent SDF, and adding normal and depth losses.
- Design Motivation: Direct 3D supervision is too expensive, but 2D surface-only supervision cannot capture occluded regions. The clever design of multi-layer peeling allows the model to indirectly learn the geometry and texture of the body beneath clothes through rendering / supervising different semantic subsets.
-
Iterative Multilayer Mesh Refinement:
- Function: Optimizing the coarse mesh output by S-LRM to enhance geometric details and texture quality.
- Mechanism: Optimization is performed inside-out according to spatial layers: first, the innermost (body) mesh is extracted and optimized guided by normal maps; once completed, the clothing mesh is overlaid, and the clothing is optimized while keeping the body fixed, incorporating a collision loss \(L_{col} = \frac{1}{n}\sum_i \max((v_j - v_i) \cdot n_j, 0)^3\) to ensure the outer layer does not penetrate the inner layer; finally, the hair is overlaid and optimized while fixing the first two layers. Each optimization step includes differentiable rendering \(\rightarrow\) computing normal/mask loss \(\rightarrow\) gradient updating vertices \(\rightarrow\) re-meshing.
- Design Motivation: LRMs are limited by tri-plane resolution, resulting in insufficient geometric details. The layered freeze-and-optimize strategy avoids cross-layer interference, the collision loss ensures the clothing stays outside the body's normal direction, and the generated clothing is completely hollow inside, making it directly available for simulation.
Loss & Training¶
- Stage 1: \(L_1 = L_{mse} + \lambda_{lpips} L_{lpips} + \lambda_{mask} L_{mask} + \lambda_{sem} L_{sem}\)
- Stage 2: For each semantic subset P, the same loss form \(L_2\) is computed, using the masked rendering results and ground truth for supervision.
- Stage 3: \(L_3 = L_2 + \lambda_{normal} L_{normal} + \lambda_{depth} L_{depth} + \lambda_{dev} L_{dev}\)
- Refinement: \(L_{r1} = \lambda'_{mask} L_{mask} + \lambda_{col} L_{col} + \lambda'_{normal} L_{normal}\)
- Training Data: Anime3D++ dataset, containing 10,811 high-quality A-pose anime character models, with multi-view multi-pose RGB, depth, normal, and semantic renderings.
Key Experimental Results¶
Main Results¶
3D Character Generation (A-pose Input):
| Method | SSIM↑ | LPIPS↓ | FID↓ | CLIP Sim↑ |
|---|---|---|---|---|
| InstantMesh | 0.888 | 0.126 | 0.107 | 0.906 |
| Unique3D | 0.889 | 0.136 | 0.030 | 0.919 |
| CharacterGen | 0.880 | 0.124 | 0.081 | 0.905 |
| Ours (StdGEN) | 0.937 | 0.066 | 0.010 | 0.941 |
2D Multi-view Generation (A-pose Input):
| Method | SSIM↑ | LPIPS↓ | FID↓ | CLIP Sim↑ |
|---|---|---|---|---|
| Era3D | 0.876 | 0.144 | 0.095 | 0.908 |
| CharacterGen | 0.886 | 0.119 | 0.063 | 0.928 |
| Ours (StdGEN) | 0.958 | 0.038 | 0.004 | 0.941 |
Ablation Study¶
User study (28 volunteers, 16 sets of comparisons): StdGEN achieves the highest preference rate across four dimensions: overall quality, fidelity, geometry quality, and texture quality.
Key Findings¶
- StdGEN significantly outperforms existing methods across all 2D and 3D evaluation metrics, with a relative decrease in FID of 87% compared to CharacterGen (0.010 vs 0.081).
- Cross-sectional views of the decomposed results show that the inside of the clothes is completely hollow, proving that the model indeed learns multilayer 3D structures rather than simple 'color painting'.
- Increasing the resolution from 512 to 1024 in the multi-view generation stage has a significant impact on final quality.
- In semantic decomposition training, outside-in hierarchical supervision is the key to learning occluded regions.
Highlights & Insights¶
- The design of semantic-equivalent NeRF/SDF is elegant—it converts a continuous semantic probability field into extractable equivalent implicit fields through a concise formula, fully compatible with existing surface extraction pipelines.
- The 'onion-peeling' concept of the multi-layer supervision strategy is clever: by programmatically masking outer semantic layers to render inner layers, it teaches the model 3D internal structures using only 2D supervision.
- The property of the clothing being hollow inside holds immense engineering value—it can be directly used for dressing, physical simulation, and rigging.
- It works even when the reference image is in arbitrary poses; the two-stage design of first converting 2D to A-pose and then performing 3D reconstruction guarantees robustness.
Limitations & Future Work¶
- Currently mainly validated on anime characters; the generalization ability to realistic humans remains to be tested.
- The granularity of semantic decomposition is fixed to three categories (body, clothing, hair); finer-grained decomposition (e.g., different clothing parts) requires more detailed annotations.
- Relying on the A-pose intermediate representation, reference images with extreme poses may lead to pose conversion failures.
- Future work could combine text/sketch inputs for controllable character generation, or introduce more semantic categories to support component decomposition such as accessories.
Related Work & Insights¶
- Compared to CharacterGen, the key improvements of StdGEN lie in introducing the semantic field for decomposition, increasing resolution to 1024, and applying multi-layer refinement.
- The concept of semantic-equivalent SDF can be extended to other scenarios requiring discrete component extraction from continuous fields (such as indoor scene decomposition and robot part segmentation).
- The outside-in hierarchical supervision paradigm inspires other 3D reconstruction tasks that involve occlusion reasoning.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Semantic-equivalent NeRF/SDF and multi-layer supervision strategies are innovative, but the overall architecture is a clever combination of existing modules.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive quantitative/qualitative/user studies, but the ablation could be more detailed.
- Writing Quality: ⭐⭐⭐⭐ — Clear pipeline and complete formula derivations.
- Value: ⭐⭐⭐⭐⭐ — First to achieve feedforward generation of editable decomposed 3D characters from a single image, holding extremely high engineering value.