LCA: Large-scale Codec Avatars - The Unreasonable Effectiveness of Large-scale Avatar Pretraining¶

Conference: CVPR 2026 arXiv: 2604.02320 Code: https://junxuan-li.github.io/lca Area: Human Understanding / 3D Vision Keywords: 3D avatars, large-scale pretraining, feed-forward generation, Gaussian splatting, expression control

TL;DR¶

LCA is the first work to apply the large-scale pretraining/post-training paradigm to 3D avatar modeling: it pretrains on 1 million in-the-wild videos to acquire broad appearance and geometry priors, then post-trains on high-quality multi-view studio data to enhance fine-grained expression fidelity, effectively breaking the inherent trade-off between generalizability and fidelity.

Background & Motivation¶

High-quality 3D avatar modeling faces a fundamental trade-off: studio data enables high-fidelity avatars but generalizes poorly (only to captured subjects); in-the-wild data generalizes to more identities but yields lower quality due to 3D ambiguity and distortion.

Core Insight: Inspired by LLMs and visual foundation models—large-scale pretraining learns general priors, while a small amount of high-quality post-training data aligns the model to the target task. This work is the first to demonstrate that this paradigm is equally effective in the 3D avatar domain.

Method¶

Overall Architecture¶

A two-branch architecture: reference image tokens + template body mesh tokens → large Transformer fusion → canonical MLP outputting Gaussian attributes + correction MLP outputting offsets under driving signals → LBS transformation to target pose → 3DGS rendering. Pretraining is conducted on 1M in-the-wild videos; post-training is conducted on multi-view studio data covering thousands of identities.

Key Designs¶

Scalable Dual-Branch Architecture:
- Function: Jointly supports training on both studio and in-the-wild data.
- Mechanism: Image tokens are derived from a general-purpose visual encoder; geometry tokens are derived from a canonicalized body template mesh. The Transformer backbone employs a hybrid attention scheme (alternating global attention and per-image self-attention) to handle a variable number of input images. The canonical branch outputs canonical Gaussian attributes; the correction branch outputs attribute offsets conditioned on driving signals.
- Design Motivation: Eliminates the need for high-quality conditioning data (e.g., geometry and texture maps), enabling seamless switching between different data sources.
Pretraining → Post-training Paradigm:
- Function: Achieves the optimal balance between generalizability and fidelity.
- Mechanism: The pretraining stage learns broad priors over human appearance and geometry from 1M in-the-wild videos. The post-training stage specializes on multi-view studio data to enhance the granularity and 3D consistency of facial expressions. Post-training augments rather than overwrites the generalization capability acquired during pretraining.
- Design Motivation: Analogous to LLM pretraining + RLHF: pretraining provides capability, post-training provides quality.
Self-supervised Expression Encoding:
- Function: Learns fine-grained facial expression control signals.
- Mechanism: A FACS-inspired self-supervised approach is used to learn latent expression codes, which serve as driving signals for the correction branch. Combined with SMPL-X body and hand poses, this enables fine-grained full-body control.
- Design Motivation: Expression is the most critical control dimension for avatars, requiring accuracy beyond that of parametric face models.

Loss & Training¶

3DGS rendering loss (L1 + D-SSIM) + perceptual loss + identity preservation loss. Pretraining is performed on 1M in-the-wild videos; post-training is performed on multi-view studio data.

Key Experimental Results¶

Main Results¶

Capability	LCA	Prev. SOTA	Notes
Identity generalization	World-scale population coverage	Thousands of identities	Hair / clothing / skin tone / accessories
Expression control	Fine-grained facial + finger-level	Coarse-grained	Significantly enhanced by post-training
3D consistency	Strong	Weak in in-the-wild methods	Synergy of pretraining + post-training
Feed-forward inference	Efficient	Requires optimization	Generated from a few images

Ablation Study¶

Configuration	Generalization	Expression Accuracy	3D Consistency	Notes
Pretrain only	Strong	Weak (blurry expressions)	Moderate (3D distortion)	Broad priors but insufficient precision
Post-train only	Weak	Strong	Strong	High quality but poor generalization
Pretrain + post-train	Strong	Strong	Strong	Best balance

Key Findings¶

Emergent capabilities are observed: without direct supervision, the model spontaneously generalizes to relighting, loose clothing support, and zero-shot robustness to stylized images.
Priors acquired during pretraining are not overwritten during post-training—analogous to capability retention in LLMs.
The scale of 1 million videos is critical for generalization ability.

Highlights & Insights¶

Transfer of the LLM paradigm to 3D: This is the first work to demonstrate that the pre/post-training paradigm breaks the generalization–fidelity trade-off in the 3D avatar domain.
Emergent capabilities: The emergence of relighting support and stylization robustness suggests that large-scale data enables the model to acquire deep physical and semantic understanding.
Efficient feed-forward inference: High-quality avatars can be generated from only a few images, making the approach suitable for practical deployment.

Limitations & Future Work¶

Pretraining on 1 million videos incurs extremely high computational costs (Meta-scale resources required).
The fidelity of the body region is lower than that of the face.
The degree of open-sourcing remains uncertain, and reproducibility has yet to be verified.

vs. Codec Avatars series: Traditional Codec Avatars require per-identity optimization; LCA is feed-forward.
vs. TRELLIS/Rodin: These methods operate at a smaller scale; LCA is the first to achieve million-scale pretraining.
vs. Real3D-Portrait: Single-image methods have limited fidelity; LCA benefits from multi-image input and large-scale pretraining.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First successful application of the pretraining/post-training paradigm to 3D avatars.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive demonstrations, though quantitative evaluation could be more standardized.
Writing Quality: ⭐⭐⭐⭐⭐ Fluent narrative with deep insights.
Value: ⭐⭐⭐⭐⭐ Transformative implications for the 3D digital human industry.