ExpPortrait: Expressive Portrait Generation via Personalized Representation¶

Conference: CVPR 2026 arXiv: 2602.19900 Code: None Area: Portrait Generation / Face Reenactment Keywords: Portrait Animation, Personalized Head Representation, Expression Transfer, Diffusion Transformer, SMPL-X

TL;DR¶

This paper proposes a high-fidelity personalized head representation (static identity offset + dynamic expression offset) to address the limited expressiveness of parametric models such as SMPL-X. Combined with an identity-adaptive expression transfer module and a DiT-based generator, the method achieves state-of-the-art performance on both self-driven portrait video animation and cross-identity reenactment tasks.

Background & Motivation¶

The core challenge in portrait video generation is achieving a balance between fine-grained expression control and identity consistency. Existing intermediate representations suffer from fundamental limitations:

2D Keypoints: Sparse signals lacking geometric detail, unstable under large poses.

3D Parametric Models (SMPL-X/FLAME): Low-rank linear approximations with predefined blendshapes that cannot model high-frequency nonlinear dynamics (e.g., wrinkles), leading to severe entanglement between identity and expression.

Implicit Motion Features: Weakly controllable with insufficient disentanglement, prone to identity leakage.

Root Cause: The low-dimensional template subspace of parametric models limits expressiveness, making it impossible to capture subject-specific anatomical structures and dynamic wrinkles, and hindering the simultaneous preservation of identity and richness of expression.

Method¶

Overall Architecture¶

A three-stage pipeline: (1) construct a personalized high-detail head representation based on SMPL-X → (2) apply an identity-adaptive expression transfer module for cross-identity expression transfer → (3) train a DiT video generator conditioned on personalized normal maps.

Key Designs¶

Personalized Head Representation: Two complementary offset fields are superimposed on the coarse SMPL-X base:
- Static Global Offset \(\Delta_g^s \in \mathbb{R}^{N_s \times 3}\): Captures expression-independent personalized geometric details (e.g., face shape, hairline, shoulder contour), constrained to non-facial regions.
- Dynamic Per-Frame Offset \(\Delta_f^s(i) \in \mathbb{R}^{N_s \times 3}\): Captures expression-related dynamics per frame (e.g., wrinkles, micro-expressions), constrained to facial regions.

The detailed mesh is formulated as: \(\widetilde{V}^s(i) = V^s + \Delta_g^s + \Delta_f^s(i)\)

where \(V^s = \mathcal{B}(V) \in \mathbb{R}^{N_s \times 3}\) is a high-resolution mesh obtained via barycentric interpolation upsampling (\(N_s \gg N\)). Disentanglement is achieved through spatial constraints (facial vs. non-facial regions) and temporal regularization (minimum magnitude penalty + Laplacian smoothing).

Optimization objectives include: - Sparse landmark loss: \(\mathcal{L}_{\text{ldmk}} = \|\Pi(L_{\text{3D}}(i), \mathbf{c}) - L_{\text{2D}}(i)\|_2^2\) - Dense normal/depth supervision: \(\mathcal{L}_{\text{normal}} = \|\hat{N}_i - N_i\|_1, \mathcal{L}_{\text{depth}} = \|\hat{D}_i - D_i\|_1\) - Expression coefficient regularization + displacement magnitude penalty + Laplacian smoothing

Identity-Adaptive Expression Transfer Module: Addresses the incompatibility of cross-identity expression offsets (e.g., a child should not inherit the deep wrinkle patterns of an elderly person):
- Driving Signal Encoder: Encodes expression coefficients \(\boldsymbol{\psi} \in \mathbb{R}^{F \times 100}\) and jaw pose \(\boldsymbol{\omega} \in \mathbb{R}^{F \times 3}\) into per-frame condition codes \(Q = \mathcal{E}(\boldsymbol{\psi}, \boldsymbol{\omega}) \in \mathbb{R}^{F \times D}\).
- Vertex-Level MLP: Conditioned on the target identity's neutral mesh \(V_{\text{neutral}} = V^s + \Delta_g^s\) and the driving code \(q_i\), it predicts personalized dynamic offsets:
\(\Delta_f^s(i) = \mathcal{G}(V_{\text{neutral}}, q_i) \in \mathbb{R}^{N_s \times 3}\)

The conditioning design ensures that transferred expressions are adapted to the anatomical structure of the target identity.

DiT Video Generator: Fine-tunes a pretrained video generation model (DiT within the LDM framework):
- Control signals: Reference frame normal map \(N^R\) + driving sequence normal maps \(N_{1:F}^D\).
- A 3D convolutional pose encoder extracts spatio-temporal features; a 2D convolutional reference encoder extracts appearance cues.
- Standard noise prediction loss:
\(\mathcal{L}_{\text{ldm}} = \mathbb{E}_{z_0, \epsilon, t}[\|\epsilon - \epsilon_\theta(z_t, t, c)\|_2^2]\)

Loss & Training¶

Data: VFHQ + CelebV-HQ + HDTF, totaling ~4,000 videos (~10 hours), at 512×512 resolution.
SMPL-X reconstruction and joint geometric optimization are first performed to obtain the personalized head model.
The expression transfer module is frozen before training the diffusion model.
30 epochs, 4×A800 GPUs, batch size 1/GPU, learning rate \(10^{-4}\).
Evaluation datasets: RAVDESS (20 videos) + NeRSemble (80 videos).

Key Experimental Results¶

Main Results¶

Method	PSNR↑	SSIM↑	LPIPS↓	L1↓	AED↓	APD↓	CSIM↑
LivePortrait	23.29	0.830	0.373	0.046	0.129	0.021	0.830
Follow-Your-Emoji	25.69	0.841	0.236	0.029	0.147	0.015	0.803
X-NeMo	21.56	0.781	0.324	0.048	0.137	0.018	0.830
Ours	26.55	0.859	0.184	0.022	0.132	0.009	0.835

The proposed method substantially outperforms all baselines on the self-driven task: PSNR +0.86 (vs. F-Y-E), LPIPS −0.052, APD as low as 0.009.

Cross-Identity Reenactment¶

Method	AED↓	APD↓	CSIM↑
LivePortrait	0.286	0.230	0.729
X-NeMo	0.171	0.021	0.722
Ours	0.211	0.013	0.729

The proposed method achieves the best trade-off between expression accuracy (AED/APD) and identity preservation (CSIM).

Ablation Study¶

Configuration	Key Metric	Description
SMPL-X baseline	Stiff expressions, limited facial dynamics	Expressiveness ceiling of standard parametric models
Direct offset transfer	Weakened and unnatural expressions	Incompatibility of offsets across identities
Full model	Rich expressions + high fidelity	Personalized representation + adaptive transfer

Key Findings¶

The personalized head representation significantly outperforms standard SMPL-X, improving both expression richness and identity fidelity simultaneously.
Implicit motion methods (Hunyuan Portrait, X-NeMo) can produce realistic results but suffer from severe identity leakage.
Explicit control methods (AniPortrait, F-Y-E) are limited by sparse or low-rank signals, resulting in insufficient expressiveness.
The expression transfer module generates more vivid and natural expressions compared to direct offset transfer.

Highlights & Insights¶

The ceiling of intermediate representations determines generation quality: Rather than improving the generator, it is more effective to increase the information density and controllability of control signals.
Static + dynamic disentanglement design: Achieved through spatial constraints (facial/non-facial regions) and temporal regularization (zero-mean dynamic offsets), without requiring additional annotations.
Identity-adaptive mechanism: Conditional prediction avoids a one-size-fits-all approach to expression transfer and is consistent with anatomical differences across individuals.

Limitations & Future Work¶

Intra-oral regions are not modeled: Details such as the tongue cannot be precisely generated.
Eye movement is insufficiently refined: Fine-grained eye motion capture is lacking.
The limited amount of training data (~10 hours) may constrain generalization to extreme poses and expressions.
The approach has not been extended to full-body animation scenarios.

Compared to the implicit keypoint approach of LivePortrait, the explicit 3D representation proposed here is more controllable and does not suffer from identity leakage.
Compared to the FLAME-driven approach of Follow-Your-Emoji, the personalized offset field captures more high-frequency details.
Insight: The idea of personalized representation can be generalized to hand and full-body animation scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐ The design of personalized offset fields combined with identity-adaptive transfer is innovative.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation covering both self-driven and cross-driven settings, with clear ablations and convincing qualitative comparisons.
Writing Quality: ⭐⭐⭐⭐ Clear exposition with well-structured pipeline diagrams and formalized notation.
Value: ⭐⭐⭐⭐ Provides a superior intermediate representation scheme for high-fidelity portrait animation with strong practical value.