AvatarPointillist: AutoRegressive 4D Gaussian Avatarization¶

Conference: CVPR 2026 arXiv: 2604.04787 Code: https://kumapowerliu.github.io/AvatarPointillist Area: 3D Vision / Digital Human Generation Keywords: 4D Avatar, Autoregressive, 3D Gaussian Splatting, Point Cloud Generation, One-shot

TL;DR¶

AvatarPointillist proposes an autoregressive (AR) generative framework for constructing 4D Gaussian avatars: a decoder-only Transformer generates 3DGS point clouds (with binding information) token by token, followed by a Gaussian Decoder that predicts rendering attributes for each point. This approach breaks free from fixed template topology, enables adaptive point density adjustment, and comprehensively outperforms baselines such as LAM and GAGAvatar on NeRSemble.

Background & Motivation¶

Background: Generating drivable 3D avatars from a single portrait image is critical for applications in VR, telepresence, and film production. Existing methods fall into two paradigms: 2D animation (GAN/diffusion) and 3D reconstruction (NeRF/3DGS).

Fundamental Limitations of 2D Methods: These methods lack 3D structural awareness, suffer from geometric distortion under extreme poses, and cannot render from arbitrary viewpoints.

Root Cause in 3DGS Methods: - GAGAvatar: Lifts 2D features to 3D, bypassing a complete point cloud representation and requiring auxiliary 2D networks for correction. - LAM: Uses fixed FLAME vertices as a template point cloud, assigning the same number of Gaussians to every identity — this prevents the model from adaptively adjusting point density to capture identity-specific features such as beards or distinctive hairstyles. - Key Challenge: Fixed topology discards the most fundamental advantage of 3DGS — adaptive control of point distribution according to geometric complexity.

Core Problem: Can a generative model be designed to directly learn the 3DGS point cloud distribution without relying on a fixed template, allowing the model to autonomously decide where to place points and how many?

Core Idea: Reformulate 3DGS avatar generation as an autoregressive sequence generation task — predicting 3D coordinates and binding indices point by point — to fully embrace the adaptive and dynamic nature of 3DGS.

Method¶

Overall Architecture¶

Two stages: 1. AR Model: Portrait image features → decoder-only Transformer → token-by-token generation of quantized point clouds $(T_n^x, T_n^y, T_n^z, T_n^b)$ 2. Gaussian Decoder: De-quantized coordinates + AR hidden features → Transformer → predicts full Gaussian attributes per point (color, opacity, scale, rotation, displacement offset)

Key Designs¶

Data Construction and Quantization:
- GaussianAvatars is applied to fit 3DGS for each identity in NeRSemble; each Gaussian is bound to a specific face of the FLAME mesh.
- A canonical FLAME mesh is used to compute a globally canonical Gaussian point cloud.
- Point cloud ordering: sorted along the y-z-x axes to ensure identical point clouds produce identical sequences.
- Coordinate quantization: 1024 discrete levels (balancing precision and efficiency).
- Binding token: $T_n^b = b_n + 1024$ (offset to a distinct region of the vocabulary), $b_n \in [0, 10143]$.
- Sequence format: $(T_1^x, T_1^y, T_1^z, T_1^b, ..., T_N^x, T_N^y, T_N^z, T_N^b)$, with Start/End/Padding tokens.
AR Model Architecture:
- Decoder-only Transformer; each layer comprises cross-attention + self-attention + FFN.
- Identity injection: DINOv2 extracts image features; Pixel3DMM provides FLAME parameters → a point cloud encoder extracts mesh features → concatenated features are injected via cross-attention.
- Standard next-token prediction objective: $p(T) = \prod_{n=1}^{4N} p(T_n | T_{<n})$
- Truncated training strategy: Given the long point cloud sequences, a sliding window (window size 12,000) is used for segmented training to improve efficiency.
Dual-Input Design of the Gaussian Decoder (Core Innovation):
- Positional feature $P_n$: De-quantized coordinates → positional encoding → MLP.
- AR feature $F_n^p$: Final hidden states extracted from the AR Transformer; an MLP aggregates the hidden features of 4 tokens into a single per-point feature.
- Both are concatenated and fed into the Gaussian Decoder.
- Design Motivation: AR hidden states carry rich semantic information accumulated during generation; positional features alone are insufficient for high-quality rendering.
Animation and Driving:
- The AR model predicts binding information per point ($T_n^b$ → corresponding FLAME face index).
- LBS weights $\hat{\mathbf{w}}_i$ and expression blendshapes $\hat{\mathbf{S}}_i$ are obtained via barycentric interpolation.
- Animation follows the standard FLAME deformation pipeline: given pose $\boldsymbol{\theta}$ and expression $\boldsymbol{\psi}$ parameters.

Loss & Training¶

AR Model: Standard cross-entropy loss; AdamW lr=1e-4; 16×H20 GPUs; 50K steps; batch size 4.
Gaussian Decoder (trained with AR model frozen): $$\mathcal{L}_{total} = \lambda_{L1}\mathcal{L}_{L1} + \lambda_{SSIM}\mathcal{L}_{SSIM} + \lambda_{LPIPS}\mathcal{L}_{LPIPS} + \lambda_{Reg}\mathcal{L}_{Reg}$$
$\lambda_{L1}=1, \lambda_{SSIM}=0.5, \lambda_{LPIPS}=0.1, \lambda_{Reg}=0.1$
8×H20 GPUs; 12,500 steps.

Key Experimental Results¶

Main Results (NeRSemble Dataset)¶

Method	LPIPS↓	FID↓	AKD↓	APD↓	Cross-FID↓	Cross-CLIP↑
Portrait4Dv2	0.20	123.02	5.32	34.53	191.13	0.63
AvatarArtist	0.21	118.94	6.87	39.58	175.69	0.61
LAM	0.24	136.01	4.37	61.83	238.54	0.54
GAGAvatar	0.18	111.76	3.93	27.94	181.22	0.71
Ours	0.15	95.18	2.38	22.86	160.74	0.75

Ablation Study¶

Configuration	LPIPS↓	FID↓	AKD↓	APD↓	Notes
FLAME Position	0.23	120.34	4.82	41.22	Fixed FLAME template (LAM-style)
AR Feature only	0.22	110.93	5.89	32.96	AR hidden features only
AR Position only	0.19	103.80	5.81	41.49	Positional encoding only
Full (Ours)	0.15	95.18	2.38	22.86	Position + AR feature dual input

Key Findings¶

AR point cloud generation vs. fixed FLAME template: FID drops from 120.34 to 95.18, confirming the advantage of adaptive point distribution.
Jointly using both positional features and AR features in the Gaussian Decoder is critical — removing either leads to significant degradation.
The FLAME Position baseline fails to capture identity-specific geometry (e.g., ponytails, dense beards), with qualitative results showing a marked gap.
Visualization of autoregressive point clouds demonstrates clearly adaptive density distribution — geometrically complex regions (hair, beard) exhibit denser point placement.

Highlights & Insights¶

Reformulating 3DGS point cloud generation as autoregressive token prediction represents a paradigm innovation, granting the generative model genuine freedom in deciding where to place points and how many.
Passing AR hidden features to the Gaussian Decoder is an elegant design — the semantic context accumulated during generation substantially improves rendering quality.
Binding prediction makes the generated point cloud natively animatable without additional post-processing.
The name "Pointillist" is apt — each Gaussian point resembles a painter's brushstroke, adaptively composing a complete scene.

Limitations & Future Work¶

Autoregressive generation produces very long sequences (tens of thousands of tokens), making inference slower than one-shot generation methods.
Training data is limited to NeRSemble (419 identities); generalization to larger and more diverse populations remains to be validated.
The pipeline relies on GaussianAvatars fitting to construct training data, making data quality dependent on fitting quality.
Discretization error introduced by 1024-level quantization may cause artifacts in fine-grained regions (e.g., around the eyes).

LAM is the most direct comparison — the core distinction between fixed templates and autoregressive generation.
MeshGPT's formulation of mesh generation as an AR task is a direct source of inspiration.
The quantization + AR paradigm is generalizable to full-body avatars and scene-level 3DGS generation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First application of AR sequence generation to 3DGS avatars; paradigm innovation is clear.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparisons and detailed ablations, though evaluation is limited to a single dataset.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; method description is detailed.
Value: ⭐⭐⭐⭐⭐ Opens a new direction for 3DGS avatar generation; the advantages of adaptive point distribution have broad implications.