FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning¶

Conference: CVPR 2026 arXiv: 2603.05506 Code: Available Area: 3D Vision Keywords: Camera control, portrait video generation, scale-awareness, facial landmarks, video diffusion models

TL;DR¶

This paper proposes FaceCam, a system that addresses camera control in monocular portrait videos by using facial landmarks as a scale-aware camera representation, thereby avoiding the scale ambiguity inherent in conventional extrinsic camera representations. Two data augmentation strategies—synthetic camera motion and multi-clip stitching—are further designed to support continuous camera trajectory inference.

Background & Motivation¶

Camera motion control in controllable video generation is a core research problem, and is particularly critical for portrait videos in applications such as social media, post-production, and AR/VR. Existing approaches face two fundamental challenges:

Challenge 1: Scale Ambiguity - Scene-agnostic parameter representations (e.g., Plücker rays, extrinsic matrices): the same parameter variation produces drastically different visual effects at different scene scales. - Monocular video cannot determine absolute depth; the scene is recoverable only up to a global similarity transformation (unknown scale and translation). - Mathematically: for any \(\alpha > 0\), letting \(\mathbf{x}' = \alpha\mathbf{x}\), \(\mathbf{t}' = \alpha\mathbf{t}\), perspective projection remains invariant.

Challenge 2: Training Data - Acquiring paired videos of the same dynamic portrait scene under different camera trajectories is extremely difficult. - Synthetic 3D dynamic portrait data rarely achieves photorealism.

Core Motivation: A camera representation is needed that does not expose the unobservable global scale, while generalizing from limited static multi-view data to continuous camera trajectories.

Method¶

Overall Architecture¶

Training Phase: 1. Facial landmarks are extracted from the anchor frame of the target video as the camera condition. 2. Source video, target video, and camera conditions are each encoded into latents via a VAE. 3. These are fed into a diffusion Transformer to predict the target latent, optimized with a flow-matching loss.

Inference Phase: 1. FaceLift is used to generate a generic 3D Gaussian head model (identity-agnostic; shared across all experiments). 2. The proxy head is rendered along the target camera trajectory. 3. MediaPipe detects per-frame facial landmarks as the camera condition. 4. The diffusion Transformer generates the controlled video.

Key Designs¶

1. Scale-Aware Camera Representation via Facial Landmarks¶

Theoretical Basis: Classical multi-view geometry establishes that point correspondences in image space are sufficient to characterize relative camera motion. Given 7+ 2D correspondences, the fundamental matrix \(F\) can be estimated, from which the essential matrix \(E = \mathbf{K}^\top F \mathbf{K}\) and relative pose \([\mathbf{R}|\mathbf{t}]\) are recoverable up to a global scale.

Implementation: - \(m\) facial landmarks are detected in the anchor frame with 3D positions \(\mathbf{X} = \{\mathbf{x}_k\}_{k=1}^m\). - These are projected to 2D pixel coordinates \(\mathbf{U} = \{\mathbf{u}_k\}_{k=1}^m\) according to the target camera pose. - Landmarks are rasterized into a pixel-space image to serve as the conditioning signal.

Scale Invariance Proof: Under simultaneous scaling of 3D landmarks and translation (\(\mathbf{x}_k' = s\mathbf{x}_k\), \(\mathbf{t}' = s\mathbf{t}\)), the 2D projections remain invariant:

\[\mathbf{u}_k' = \mathcal{N}(\mathbf{K}(\mathbf{R}\mathbf{x}_k' + \mathbf{t}')) = \mathcal{N}(\mathbf{K}s(\mathbf{R}\mathbf{x}_k + \mathbf{t})) = \mathbf{u}_k\]

Sufficiency: Given 3D landmarks and their 2D projections, a PnP solver recovers camera rotation and translation up to a global scale.

User-Friendliness: The rasterized landmark map provides a direct preview of the target viewpoint, making camera control intuitively interpretable.

2. Training Data Generation Strategies¶

The system is built on the NeRSemble dataset (425 identities, 16 synchronized views, ~9.4K videos) supplemented by ~800 in-the-wild monocular videos.

Strategy	Method	Problem Addressed
Scale + color augmentation	Random scaling [0.75, 1.25], foreground segmentation + random background color	Increases data diversity
Synthetic camera motion	Simulated zoom (scale interpolation) and pan (crop offset interpolation)	Introduces dynamic cameras, but limited to parallel motion
Multi-clip stitching	Random concatenation of 1–4 clips from different camera positions	Introduces camera rotation (discrete pose changes)
In-the-wild data supplement	Synthetic camera motion applied to monocular videos	Mitigates overfitting to studio lighting

Key Finding: Although training includes only discrete camera pose changes (via multi-clip stitching), the model generalizes to continuous camera trajectory inference.

Loss & Training¶

Built on the Wan open-source video foundation model with a flow-matching loss.
Source video latents are concatenated with noisy latents via frame conditioning.
Camera condition latents are injected via channel conditioning.
Only 3D attention layers and projection layers are fine-tuned (following ReCamMaster).
Training: 24 NVIDIA A100 GPUs, 3K steps, learning rate 5e-5, batch size 24.
Total training data: ~9.1K videos (8.9K NeRSemble + ~200 in-the-wild).

Key Experimental Results¶

Main Results¶

Table 1: Static Camera Evaluation on the Ava-256 Dataset

Method	PSNR↑	SSIM↑	LPIPS↓	ArcFace↑
ReCamMaster	9.73	0.557	0.581	0.701
TrajectoryCrafter	10.32	0.546	0.567	0.522
FaceCam* (generic head)	9.83	0.582	0.549	0.807
FaceCam	15.85	0.721	0.252	0.857

Table 2: Dynamic Camera Evaluation on In-the-Wild Videos (100 videos, 10 motion types)

Method	Camera Accuracy	ArcFace	Quality	Aesthetics	Subject Consistency	Background Consistency
ReCamMaster	83%	78.92	69.05	55.85	93.26	93.02
TrajectoryCrafter	99%	49.79	71.37	55.76	92.23	92.25
FaceCam (w/o wild)	100%	77.73	70.71	55.73	94.52	95.16
FaceCam	97%	83.94	73.49	59.91	94.77	94.98

Ablation Study¶

Training Data Ablation (last two rows of Table 2): - NeRSemble only: camera control is near-perfect (100%), but identity preservation (77.73) and image quality (70.71) are lower. - Adding in-the-wild videos: identity preservation improves substantially (83.94), image quality improves (73.49), with a marginal drop in camera accuracy (97%).

Key Findings¶

FaceCam substantially outperforms baselines in PSNR (15.85 vs. 10.32), demonstrating that scale-aware representation is critical for precise camera control.
ReCamMaster fails under large viewpoint changes, with scale ambiguity causing the head to move out of frame.
TrajectoryCrafter suffers from geometric errors in dynamic point cloud estimation, causing facial distortion (ArcFace: 0.522).
Facial landmark conditioning encodes not merely facial position, but camera pose and scale decoupled from head motion.
Training on discrete camera changes generalizes to continuous trajectories—an unexpected and practically valuable finding.
The generic 3D head model (FaceCam*) underperforms full FaceCam (which uses ground-truth landmarks), yet still surpasses baselines in identity preservation.

Highlights & Insights¶

Elegant theoretical foundation: The camera representation is derived from first principles of multi-view geometry, with mathematically guaranteed scale invariance.
Practical inference pipeline: Target trajectories are rendered using a generic 3D head model followed by landmark detection, requiring no input-specific 3D reconstruction.
Data efficiency: State-of-the-art results are achieved with only ~9.1K videos and 3K training steps, far fewer than typically required.
Unexpected generalization from multi-clip stitching: Generalization from discrete pose changes to continuous trajectories is a significant empirical finding.

Limitations & Future Work¶

Robustness depends on facial landmark detection; extreme profile views or heavy occlusions may degrade performance.
The generic proxy head model disregards variation in actual head shape across input videos.
Validation is limited to single-person scenes; camera control for multi-person portraits is not addressed.
The approach is restricted to portrait videos and does not generalize to camera control in arbitrary scenes.
Only ~200 in-the-wild videos are used for training; incorporating more may further improve generalization.

vs. ReCamMaster: ReCamMaster uses scene-agnostic extrinsic conditioning and is susceptible to scale ambiguity; FaceCam uses a scene-aware landmark representation.
vs. TrajectoryCrafter: The latter relies on 3D point cloud reconstruction and inpainting, where geometric errors are amplified into facial distortions.
Similarity to ControlNet: Camera conditions are likewise injected via image channels, though here landmark maps encode geometric transformations rather than structural appearance.
Complementary to NeRF/3DGS methods: No per-instance optimization is required; generation is achieved in a single forward pass.
Broader inspiration: Using facial landmarks as proxies for geometric correspondences is a paradigm extensible to hands, bodies, and other anatomical structures with stable keypoints.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Scale-aware camera representation is theoretically elegant; the discrete-to-continuous generalization finding is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ — Dual evaluation on studio and in-the-wild data, though additional ablations (e.g., number of landmarks, base model selection) are absent.
Writing Quality: ⭐⭐⭐⭐⭐ — Theoretical derivations are clear, problem formulation is precise, and figures are intuitive.
Value: ⭐⭐⭐⭐⭐ — A practical and training-efficient solution to portrait video camera control with strong empirical performance.