LCA: Large-scale Codec Avatars - The Unreasonable Effectiveness of Large-scale Avatar Pretraining¶
Conference: CVPR 2026 arXiv: 2604.02320 Code: https://junxuan-li.github.io/lca Area: Human Understanding / 3D Vision Keywords: 3D avatars, large-scale pretraining, feed-forward generation, Gaussian splatting, expression control
TL;DR¶
LCA is the first work to apply the large-scale pretraining/post-training paradigm to 3D avatar modeling: it pretrains on 1 million in-the-wild videos to acquire broad appearance and geometry priors, then post-trains on high-quality multi-view studio data to enhance fine-grained expression fidelity, effectively breaking the inherent trade-off between generalizability and fidelity.
Background & Motivation¶
High-quality 3D avatar modeling faces a fundamental trade-off: studio data enables high-fidelity avatars but generalizes poorly (only to captured subjects); in-the-wild data generalizes to more identities but yields lower quality due to 3D ambiguity and distortion.
Core Insight: Inspired by LLMs and visual foundation models—large-scale pretraining learns general priors, while a small amount of high-quality post-training data aligns the model to the target task. This work is the first to demonstrate that this paradigm is equally effective in the 3D avatar domain.
Method¶
Overall Architecture¶
A two-branch architecture: reference image tokens + template body mesh tokens → large Transformer fusion → canonical MLP outputting Gaussian attributes + correction MLP outputting offsets under driving signals → LBS transformation to target pose → 3DGS rendering. Pretraining is conducted on 1M in-the-wild videos; post-training is conducted on multi-view studio data covering thousands of identities.
Key Designs¶
-
Scalable Dual-Branch Architecture:
- Function: Jointly supports training on both studio and in-the-wild data.
- Mechanism: Image tokens are derived from a general-purpose visual encoder; geometry tokens are derived from a canonicalized body template mesh. The Transformer backbone employs a hybrid attention scheme (alternating global attention and per-image self-attention) to handle a variable number of input images. The canonical branch outputs canonical Gaussian attributes; the correction branch outputs attribute offsets conditioned on driving signals.
- Design Motivation: Eliminates the need for high-quality conditioning data (e.g., geometry and texture maps), enabling seamless switching between different data sources.
-
Pretraining → Post-training Paradigm:
- Function: Achieves the optimal balance between generalizability and fidelity.
- Mechanism: The pretraining stage learns broad priors over human appearance and geometry from 1M in-the-wild videos. The post-training stage specializes on multi-view studio data to enhance the granularity and 3D consistency of facial expressions. Post-training augments rather than overwrites the generalization capability acquired during pretraining.
- Design Motivation: Analogous to LLM pretraining + RLHF: pretraining provides capability, post-training provides quality.
-
Self-supervised Expression Encoding:
- Function: Learns fine-grained facial expression control signals.
- Mechanism: A FACS-inspired self-supervised approach is used to learn latent expression codes, which serve as driving signals for the correction branch. Combined with SMPL-X body and hand poses, this enables fine-grained full-body control.
- Design Motivation: Expression is the most critical control dimension for avatars, requiring accuracy beyond that of parametric face models.
Loss & Training¶
3DGS rendering loss (L1 + D-SSIM) + perceptual loss + identity preservation loss. Pretraining is performed on 1M in-the-wild videos; post-training is performed on multi-view studio data.
Key Experimental Results¶
Main Results¶
| Capability | LCA | Prev. SOTA | Notes |
|---|---|---|---|
| Identity generalization | World-scale population coverage | Thousands of identities | Hair / clothing / skin tone / accessories |
| Expression control | Fine-grained facial + finger-level | Coarse-grained | Significantly enhanced by post-training |
| 3D consistency | Strong | Weak in in-the-wild methods | Synergy of pretraining + post-training |
| Feed-forward inference | Efficient | Requires optimization | Generated from a few images |
Ablation Study¶
| Configuration | Generalization | Expression Accuracy | 3D Consistency | Notes |
|---|---|---|---|---|
| Pretrain only | Strong | Weak (blurry expressions) | Moderate (3D distortion) | Broad priors but insufficient precision |
| Post-train only | Weak | Strong | Strong | High quality but poor generalization |
| Pretrain + post-train | Strong | Strong | Strong | Best balance |
Key Findings¶
- Emergent capabilities are observed: without direct supervision, the model spontaneously generalizes to relighting, loose clothing support, and zero-shot robustness to stylized images.
- Priors acquired during pretraining are not overwritten during post-training—analogous to capability retention in LLMs.
- The scale of 1 million videos is critical for generalization ability.
Highlights & Insights¶
- Transfer of the LLM paradigm to 3D: This is the first work to demonstrate that the pre/post-training paradigm breaks the generalization–fidelity trade-off in the 3D avatar domain.
- Emergent capabilities: The emergence of relighting support and stylization robustness suggests that large-scale data enables the model to acquire deep physical and semantic understanding.
- Efficient feed-forward inference: High-quality avatars can be generated from only a few images, making the approach suitable for practical deployment.
Limitations & Future Work¶
- Pretraining on 1 million videos incurs extremely high computational costs (Meta-scale resources required).
- The fidelity of the body region is lower than that of the face.
- The degree of open-sourcing remains uncertain, and reproducibility has yet to be verified.
Related Work & Insights¶
- vs. Codec Avatars series: Traditional Codec Avatars require per-identity optimization; LCA is feed-forward.
- vs. TRELLIS/Rodin: These methods operate at a smaller scale; LCA is the first to achieve million-scale pretraining.
- vs. Real3D-Portrait: Single-image methods have limited fidelity; LCA benefits from multi-image input and large-scale pretraining.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First successful application of the pretraining/post-training paradigm to 3D avatars.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive demonstrations, though quantitative evaluation could be more standardized.
- Writing Quality: ⭐⭐⭐⭐⭐ Fluent narrative with deep insights.
- Value: ⭐⭐⭐⭐⭐ Transformative implications for the 3D digital human industry.