Coherent 3D Portrait Video Reconstruction via Triplane Fusion¶

Conference: CVPR 2025
arXiv: 2405.00794
Code: https://research.nvidia.com/labs/amri/projects/stable3d
Area: 3D Vision
Keywords: 3D portrait reconstruction, triplane fusion, temporal coherence, telepresence, monocular video

TL;DR¶

A triplane fusion-based method is proposed to fuse personalized 3D priors with frame-by-frame observations, achieving both temporal coherence and faithful reconstruction of dynamic appearances from monocular RGB videos for 3D telepresence.

Background & Motivation¶

3D telepresence is a core technology for presenting distant people face-to-face in 3D. Existing methods face a dilemma:

Frame-by-frame 3D reconstruction (e.g., LP3D): Faithfully captures the dynamic appearance of each frame (expressions, lighting) but lacks temporal consistency, resulting in severe artifacts and identity distortion under profile views.
Self-driven reenactment methods (e.g., GPAvatar): Reenact by constructing a canonical frame from a reference image. While temporally consistent, they fail to faithfully reconstruct real-time dynamic appearances (e.g., specific expressions, lighting changes, tongues, or other details missing from the reference frame).

The core insight of this paper is: it is necessary to simultaneously maintain temporal coherence and faithful reconstruction of frame-by-frame dynamic appearances. The solution lies in a fusion-based approach, which leverages the stability of personalized triplane priors while preserving the dynamic details from frame-by-frame observations.

Method¶

Overall Architecture¶

The input consists of a monocular RGB video and a (near-)frontal reference image. A pre-trained, frozen LP3D is used to encode both the reference image and each input frame into a triplane. Two core modules then process them: the Triplane Undistorter removes view-dependent distortions from the raw triplanes, and the Triplane Fuser fuses the undistorted triplanes with the personalized prior triplane to produce final temporally coherent triplanes that retain dynamic appearances. The entire system is trained solely on synthetic data generated by a 3D GAN (Next3D).

Key Designs¶

Triplane Undistorter:
- Function: Corrects geometric distortions in raw triplanes caused by profile-view inputs.
- Mechanism: Based on the SPyNet optical flow architecture, it takes the raw triplane \(T_{raw}\) as the source and the prior triplane \(T_{prior}\) as the condition to predict an undistortion flow field \(T_{flow}\). The undistorted triplane is obtained via a warping operation: \(T_{undist} = Warp(T_{raw}, T_{flow})\). Note that this is not flow-based alignment, but rather a prior-conditioned corrective warping.
- Design Motivation: LP3D exhibits directional distortion and abnormal activations in the triplane under profile inputs (e.g., over-activation on the left side of the triplane when filming the left profile). Direct corrective warping is more efficient than generating the triplane from scratch.
Triplane Fuser:
- Function: Fuses the undistorted triplane with the personalized prior to recover occluded regions and stabilize identity.
- Mechanism: Based on the RVRT (Recurrent Video Restoration Transformer) architecture, it takes \(T_{undist}\), \(T_{prior}\), and their respective visibility triplanes \(T_{vis}\) as inputs. Explicitly predicting a 3D visibility map, the Fuser preserves frame-by-frame dynamic information in visible areas and integrates personalized details (e.g., birthmarks, tattoos) from the prior triplane in occluded regions. The summation skip connection of RVRT is replaced with a convolutional skip connection because the scale of triplane distortions is much larger than that of image denoising.
- Design Motivation: Different parts are occluded across different frames, and the frontal reference image typically contains complete facial information from both sides, which effectively compensates for occlusions.
Synthetic Dynamic Multi-View Data Generation:
- Function: Generates training data to bypass the scarcity of real 3D portrait data.
- Mechanism: An expression-controllable 3D GAN (Next3D) is utilized to generate synthetic 3D portrait pairs with diverse expressions. Shoulder rotation augmentation (simulating shoulder movement by warping camera rays during volume rendering) and color-space augmentation (simulating lighting variations) are designed. A frozen LP3D is used to generate pseudo-ground-truth triplanes \(T_{frontalGT}\) from frontal renderings as supervision signals.
- Design Motivation: Next3D cannot control shoulder rotation; ray warping enables shoulder pose diversity in 2D renderings without modifying the triplane.

Loss & Training¶

The total loss is a weighted sum of four terms:

\[L = w_{undist}L_{undist} + w_{vis}L_{vis} + w_{fusion}L_{fusion} + w_{render}L_{render}\]

\(L_{undist}\): L1 loss between the undistorted triplane and the pseudo-ground-truth triplane.
\(L_{vis}\): L1 loss between the predicted visibility triplane and the ground-truth visibility.
\(L_{fusion}\): L1 loss between the fused triplane and the pseudo-ground-truth, with higher weights on occluded regions.
\(L_{render}\): LPIPS perceptual loss between the rendered novel-view image and the ground truth.

The Undistorter and Fuser employ three independent but identical networks for the three planes (xy/xz/yz) to prevent collapsing into 2D.

Key Experimental Results¶

Main Results¶

Method	Type	Expr↓	ID↓	Overall PSNR↑	Overall LPIPS↓	NVS PSNR↑	NVS LPIPS↓
Li et al.	reenact	0.2657	0.2410	18.57	0.2546	18.20	0.2624
GPAvatar	reenact	0.2041	0.2074	21.95	0.2334	21.95	0.2334
VIVE3D	invert	0.2900	0.3951	18.58	0.2593	18.14	0.2710
LP3D	recon	0.1676	0.2154	22.33	0.2232	21.52	0.2374
Ours	recon	0.1584	0.1865	22.77	0.2189	22.44	0.2240

Ablation Study¶

Configuration	Overall PSNR↑	Overall LPIPS↓	Input View Variation↓	Novel View Variation↓
LP3D (baseline)	22.33	0.2232	High	High
Only Fuser	Slightly lower	Slightly lower	Medium	Medium
Only Undistorter	Medium	Medium	Low	Low
U + F (Full)	22.77	21.89	Lowest	Lowest

Key Findings¶

LP3D severely overfits to the input view, exhibiting a large gap between Overall and NVS quality (PSNR 22.33 vs 21.52). The proposed method has the smallest gap (22.77 vs 22.44), proving the improvement in temporal coherence.
Reenactment methods (e.g., GPAvatar) cannot capture dynamic appearance details (e.g., tongue protrusion, specific wrinkles), presenting expression errors much higher than the proposed method.
Using the Fuser alone without prior Undistortion yields poor results, demonstrating the necessity of the two-stage design that corrects geometric distortions before fusion.

Highlights & Insights¶

Precise Problem Definition: It clearly articulates for the first time that temporal coherence and dynamic appearance reconstruction must be addressed simultaneously in 3D telepresence.
Rational Design of Multi-View Evaluation Protocol: The \(N \times N\) evaluation matrix covers all input-evaluation view combinations, avoiding the overfitting illusion of single-view evaluations.
Generalization to the Real World via Purely Synthetic Data Training, thanks to meticulously designed data augmentations (shoulder rotations, lighting variations).

Limitations & Future Work¶

Reliance on LP3D as a front-end; its performance upper bound restricts the overall quality.
Requirement of a near-frontal reference image; performance may degrade when a high-quality frontal image is unavailable.
Real-time performance is not discussed in detail; the inference speed of the Undistorter and Fuser may affect real-time telepresence.
Shoulder augmentation is achieved via ray warping, which might not cover complex body movements.

Complementary to LP3D's "frame-by-frame reconstruction" and GPAvatar's "reference-driven reenactment": the fusion design takes the best of both worlds.
The Triplane Undistorter employs an optical flow architecture for triplane undistortion, serving as an ingenious cross-domain transfer.
The explicit prediction of visibility triplanes provides spatial guidance for the fusion process, which is worth adopting in other triplane fusion tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ The philosophy of fusing personalized priors with frame-by-frame reconstruction is novel, and the two-stage design of triplane undistortion followed by fusion is rational.
Experimental Thoroughness: ⭐⭐⭐⭐ A new multi-view evaluation protocol is proposed, with comprehensive comparisons against various methods on NeRSemble; however, quantitative evaluations on real-world scenes are lacking.
Writing Quality: ⭐⭐⭐⭐⭐ Clear problem definition, well-justified method motivations and design decisions, and rigorous terminology definition.
Value: ⭐⭐⭐⭐ Significantly drives the practical deployment of 3D telepresence, and the evaluation protocol makes an independent contribution.