TeRA: Rethinking Text-guided Realistic 3D Avatar Generation¶
Conference: ICCV 2025 arXiv: 2509.02466 Code: Available Area: Image Generation / 3D Avatar Generation Keywords: 3D Avatar, Latent Diffusion, Text-to-3D, SMPL-X, UV Gaussian
TL;DR¶
TeRA is proposed as the first text-guided 3D realistic avatar generation framework based on a latent diffusion model. By distilling a large-scale human reconstruction model to construct a structured latent space, TeRA generates realistic 3D human avatars in 12 seconds—two orders of magnitude faster than SDS-based methods.
Background & Motivation¶
3D avatar creation is a critical requirement for the metaverse, film, gaming, and AR/VR applications. Existing approaches face dilemmas along two paradigms:
SDS-based methods (TADA, HumanGaussian, HumanNorm, etc.): - Advantages: leverage rich human priors from 2D diffusion models; no 3D training data required. - Limitations: ① iterative optimization is extremely slow (hours per scene); ② lack of explicit 3D structure in 2D models leads to multi-view inconsistency; ③ oversaturation, cartoonization, and distorted proportions.
General large-scale 3D generation models: - Training data contains an excess of cartoon assets and scarce realistic human models, resulting in severe style bias. - Unable to generate photorealistic 3D humans.
Core idea: train a native 3D diffusion model directly on 3D human data. The key challenge is constructing an efficient 3D human latent space suitable for diffusion model learning.
Method¶
Overall Architecture¶
TeRA adopts a two-stage training pipeline:
Stage 1: Distillation Decoder - Distills a compact structured latent space from the large-scale human reconstruction model IDOL. - IDOL encodes input images into UV-aligned features, but at an excessively high resolution (1536×1536).
Stage 2: Structured Latent Diffusion Model - Trains a text-conditioned diffusion model within the distilled latent space (256×256). - Generates UV feature maps from noise and decodes them into 3D Gaussian human avatars.
Key Designs¶
1. 3D Human Representation: UV-Structured Gaussians
The 3D human is represented using SMPL-X-aligned UV-structured Gaussians: - Each Gaussian position is initialized to an SMPL-X mesh vertex. - A neural network predicts offset values (position, rotation, and scale offsets) as well as color and opacity. - All attributes are stored in multi-channel attribute maps in the SMPL-X UV space.
Supports: ① direct animation driving, ② texture editing, ③ shape editing.
2. Distillation-based Latent Encoding
Directly training a VAE on complex 3D human models is unstable and computationally expensive. TeRA's alternative:
- Uses the IDOL encoder to extract UV features (which already exhibit good structural properties and generalization).
- Downsamples UV features from 1536×1536 to 256×256.
- Trains a compact convolutional distillation decoder: upsamples to 1024×1024 and decodes geometry and color attributes via two branches.
- Avoids the instability and posterior collapse associated with training a VAE.
Training loss:
3. Text Annotation Pipeline
Large vision-language models are used to generate precise text annotations for the HuGe100K dataset: - Qwen2.5-VL processes front/back/left/right multi-view images to obtain descriptions of individual body parts. - Qwen2.5 extracts key information to produce refined descriptions of ≤40 words. - Five additional phrase-level descriptions of varying lengths (8–16 words) are generated.
4. Structured Latent Diffusion Model
- Text encoding: CLIP text encoder, 77 tokens × 768 dimensions.
- Noise schedule: DDPM, 1000 training steps, 100 inference steps.
- Prediction target: \(x_0\)-prediction.
- Classifier-free guidance: text condition dropped with 20% probability.
5. Structure-Aware Editing (Virtual Try-On)
Leverages the inpainting capability of the diffusion model: - Body regions to be preserved (background) are kept in the latent space. - Garment regions (foreground) are denoised from noise under the guidance of a target text prompt. - At each step, the clean background is re-noised to the corresponding timestep and merged with the foreground. - Produces seamlessly transitioned outfit transfer results.
Loss & Training¶
- Stage 1: 4× RTX A6000, batch=2, L2 + VGG + offset regularization.
- Stage 2: 4× RTX 3090, batch=8, DDPM 1000 steps, MSE loss.
- Total training time: approximately 90 hours.
- Inference time: 12 seconds (single RTX 3090).
Key Experimental Results¶
Main Results (Table)¶
| Method | CLIP Score↑ | VQA Score↑ | Text Consistency↑ | Visual Quality↑ | Realism↑ | Time↓ |
|---|---|---|---|---|---|---|
| TADA | 29.86 | 0.64 | 3.27 | 2.25 | 2.11 | 2.3h |
| X-Oscar | 32.46 | 0.80 | 3.56 | 2.54 | 2.26 | 2.0h |
| HumanGaussian | 29.31 | 0.82 | 3.74 | 2.49 | 2.28 | 1.0h |
| HumanNorm | 29.94 | 0.72 | 3.79 | 3.01 | 3.04 | 4.0h |
| TeRA | 30.17 | 0.82 | 4.54 | 4.33 | 4.35 | 12s |
TeRA comprehensively outperforms all baselines in user ratings: text consistency 4.54 vs. 3.79, visual quality 4.33 vs. 3.01, realism 4.35 vs. 3.04. Speed is improved by two orders of magnitude.
Ablation Study (Table)¶
| Ablation | Result |
|---|---|
| Latent space 128×128 vs. 256×256 | 256 resolution yields richer detail and fewer artifacts |
| Direct feature replacement vs. inpainting editing | Inpainting produces more natural transitions with fewer artifacts |
Key Findings¶
- Fundamental limitations of SDS: SDS baselines are consistently oversaturated and proportionally distorted; even HumanNorm's improved geometry still exhibits artifacts on faces and hands.
- Feed-forward vs. iterative: Single-pass feed-forward generation is not only 100× faster but also avoids the multi-view inconsistency inherent to SDS.
- Necessity of distillation: Directly connecting the diffusion model to the VAE encoder causes posterior collapse; the distillation module is essential.
- Effect of latent space resolution: 256×256 significantly reduces artifacts compared to 128×128.
Highlights & Insights¶
- Paradigm shift: From "guiding 3D optimization with 2D diffusion models" to "training a diffusion model directly in 3D space."
- Distillation over VAE: Cleverly repurposes the encoding space of an existing large-scale reconstruction model (IDOL), avoiding the instability of training a VAE from scratch.
- Multiple benefits of structured representation: UV Gaussians natively support downstream applications including animation driving, editing, and virtual try-on.
- VLM annotation pipeline: The collaborative annotation scheme of Qwen2.5-VL + Qwen2.5 provides a reusable solution for large-scale 3D dataset text annotation.
- 12-second generation: Achieves production-level speed on a single RTX 3090.
Limitations & Future Work¶
- Static models: Training data consists of static 3D humans; dynamic details such as cloth wrinkles induced by motion cannot be modeled.
- Loose garments: Dependence on SMPL-X representation limits the modeling quality of loose clothing such as skirts.
- Dataset dependency: Requires large-scale 3D human datasets such as HuGe100K.
- Hand and facial detail: Although superior to SDS-based methods, fidelity in these regions still has room for improvement.
- Single-person limitation: Currently only supports single-person generation.
Related Work & Insights¶
- IDOL: A large-scale human reconstruction model; TeRA distills its encoding space as the latent space foundation.
- HuGe100K: A dataset of 100K realistic 3D humans providing the required training data.
- Stable Diffusion / LDM: The latent diffusion model framework; TeRA extends it from 2D images to 3D humans.
- TADA: An SDS-based method using SMPL-X + UV displacement maps, representing the previous SOTA paradigm.
- HumanNorm: An SDS-based method introducing normal diffusion to partially alleviate geometry issues.
- Insight: Domain-specific large-scale reconstruction models can serve as "encoder priors" for generative models; the distillation approach is more practical than training a VAE from scratch.
Rating¶
| Dimension | Score (1–5) |
|---|---|
| Novelty | 4.5 |
| Technical Depth | 4 |
| Experimental Thoroughness | 4 |
| Writing Quality | 4 |
| Practical Value | 4.5 |
| Overall | 4 |