PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation¶

Conference: ICCV 2025 arXiv: 2411.17048 Code: https://personalvideo.github.io/ Area: Image Generation Keywords: Video customization, identity preservation, reward supervision, T2V generation, semantic consistency

TL;DR¶

This paper proposes PersonalVideo, a framework that applies hybrid reward supervision—comprising an Identity Consistency Reward (ICR) and a Semantic Consistency Reward (SCR)—directly to generated videos. This approach eliminates the distribution gap between T2I fine-tuning and T2V inference inherent in conventional methods, achieving high identity fidelity while preventing degradation of motion dynamics and semantic alignment.

Background & Motivation¶

Text-to-video (T2V) generation has seen substantial progress, yet identity-specific human video generation remains immature. The core objective is to generate diverse videos featuring a specific individual across varied actions, scenes, and styles, given only a small set of reference images, while maintaining high identity fidelity (ID fidelity).

Limitations of Prior Work¶

Existing video identity customization methods (e.g., MagicMe, DreamBooth for video) largely follow the image customization paradigm: fine-tuning a T2I model on reference images to inject identity, then transferring the customized prior into a T2V model for inference. This strategy introduces a fundamental tension—the tuning-inference gap:

Distribution mismatch: The prior distributions of T2I and T2V models are inherently misaligned. Learning identity via static image reconstruction on a T2I model significantly shifts the video prior of the T2V model, causing generated videos to become nearly static (dynamic degradation) and unable to follow text prompts (semantic degradation).

Insufficient identity fidelity: Because fine-tuning is performed on images while inference targets videos, the distribution gap also undermines the effectiveness of identity injection. The human visual system is highly sensitive to facial features, demanding greater consistency.

Excessive data requirements: To inject identity while preserving dynamics, conventional methods typically require multiple reference images or even additional video inputs, imposing significant inconvenience on users.

Starting Point¶

The authors' core insight is: rather than performing image-space reconstruction, reward supervision should be applied directly to videos generated by the T2V model. This offers two advantages: training and inference both reside in the video domain, fundamentally eliminating the tuning-inference gap; and multiple reward signals can simultaneously optimize identity fidelity alongside semantic and dynamic preservation.

Method¶

Overall Architecture¶

The training pipeline of PersonalVideo proceeds as follows: starting from pure noise, the target T2V model generates videos, upon which two rewards are simultaneously applied—the Identity Consistency Reward (ICR) and the Semantic Consistency Reward (SCR). During optimization, Simulated Prompt Augmentation is employed, randomly sampling from LLM-generated diverse prompts for training. The learnable module adopts an Isolated Identity Adapter that injects identity only during the later denoising steps.

Key Designs¶

Identity Consistency Reward (ICR):
- Function: Aligns the facial identity of generated video characters with the reference image.
- Mechanism: A pretrained face recognition model \(\mathcal{R}_{id}\) extracts facial ID embeddings from both the reference image and randomly sampled frames of the generated video. The cosine similarity loss between them is minimized: \(\mathcal{L}_{\text{ICR}} = \mathbb{E}_{i, c \sim p(c)} \left[ \text{CosSim}\left(\mathcal{R}_{id}(I_{ref}), \mathcal{R}_{id}(G_{\mathcal{T}}(z_T, c, i))\right) \right]\) where \(G_{\mathcal{T}}\) denotes the target T2V model (with VAE decoder) and \(c\) is a text prompt containing a specific keyword.
- Design Motivation: Unlike reconstruction objectives that require paired reference images, ICR evaluates identity similarity directly on generated videos, fully aligning with the inference-time distribution. Face cropping and color jitter augmentations are applied during training to improve robustness under limited reference images.
Semantic Consistency Reward (SCR):
- Function: Preserves the semantic distribution of the original T2V model, preventing dynamic and semantic degradation.
- Mechanism: A semantic reward model \(\mathcal{R}_{sem}\) evaluates image–text correspondence scores for frames generated by both the original and target models. The scores are normalized into probability distributions and aligned via KL divergence: \(V_c^{\mathcal{S}} = \text{Softmax}(\{\mathcal{R}_{sem}(G_{\mathcal{S}}(z_T, c, i))\}_{i=1}^{M})\) \(V_c^{\mathcal{T}} = \text{Softmax}(\{\mathcal{R}_{sem}(G_{\mathcal{T}}(z_T, c, i))\}_{i=1}^{M})\) \(\mathcal{L}_{\text{SCR}} = \mathbb{E}_{c \sim p(c)} D_{KL}(V_c^T \| V_c^S)\) where \(G_{\mathcal{S}}\) is the frozen original model and \(M\) is the number of sampled frames.
- Design Motivation: Identity injection inevitably introduces distributional shift (due to the use of limited static images). By aligning semantic distributions rather than directly constraining pixels, SCR preserves the original model's dynamic and semantic capabilities without interfering with identity injection.
Simulated Prompt Augmentation:
- Function: An LLM generates 50 diverse prompts unrelated to the reference image, from which prompts are randomly sampled during training.
- Mechanism: Conventional reconstruction-based methods are confined to prompts describing the reference image, limiting generalizability. Since the proposed framework does not perform reconstruction, prompts covering arbitrary semantic scenarios (e.g., "V playing violin," "V smiling on the beach") can be introduced, closely matching actual test conditions.
- Design Motivation: Being independent of reference images and unconstrained by their quantity, this strategy effectively mitigates overfitting and maintains strong robustness even with a single reference image.
Isolated Identity Adapter:
- Function: Injects identity information only during the later denoising steps.
- Mechanism: Observations reveal that during video denoising, character motion is formed in early steps while appearance details are recovered in later steps. Accordingly, a LoRA-style low-rank adapter is activated only in the final quarter of denoising steps: \(\tilde{W} = W + \Delta W = W + A^{\text{down}} A^{\text{up}}\)
- Design Motivation: By not intervening in motion generation during early steps, the approach maximally preserves the dynamic properties of the original video prior.

Loss & Training¶

The overall training objective is a simple sum of the two rewards:

\[\mathcal{L}_{\text{train}} = \mathcal{L}_{\text{ICR}} + \mathcal{L}_{\text{SCR}}\]

ResNet-100 (pretrained on Glint360K) serves as the identity reward model, and HPSv2 serves as the semantic reward model. Effectiveness is validated on both the DiT-based HunyuanVideo and the UNet-based AnimateDiff.

Key Experimental Results¶

Main Results¶

Method	Face Sim.↑	Dyna. Deg.↑	FVD↓	T. Cons.↑	CLIP-T↑	CLIP-I↑
DreamBooth	42.62	13.86	1325.89	0.9919	26.26	44.27
MagicMe	50.51	11.88	1336.73	0.9928	25.48	73.03
IDAnimator	43.88	14.33	1538.44	0.9912	24.33	50.23
ConsisID	53.22	15.22	1622.21	0.9923	25.39	74.58
PersonalVideo	62.35	17.80	1272.32	0.9935	26.30	76.48

PersonalVideo achieves a substantial lead in face similarity (62.35 vs. 53.22), the highest dynamics score (17.80 vs. 15.22), and the lowest FVD, demonstrating that generated videos are both identity-faithful and visually natural.

Ablation Study¶

Configuration	Face Sim.↑	CLIP-T↑	Dynamic↑	Note
T2I w/o Aug	51.56	22.40	16.30	Trained on T2I; tuning-inference gap present
T2V w/o Aug	60.26	25.50	17.20	Trained on T2V; large Face Sim. gain
T2V w/ Aug	61.05	28.59	17.85	+Prompt augmentation; CLIP-T improves by 3+
w/o SCR	61.08	26.38	13.22	Without SCR; severe dynamic degradation
w/ SCR	61.05	28.59	17.85	+SCR; dynamics improve from 13 to 18
All-step injection	62.37	26.95	13.93	Identity injected at all steps; dynamic degradation
1/4-step injection	63.90	27.47	18.00	Injected only in last 1/4 steps; best dynamics and Face Sim.

Key Findings¶

Training directly on the T2V model improves Face Similarity by approximately 10 points over T2I-based training (51→61).
SCR is critical for preserving dynamics: without SCR, the dynamics score drops to 13.22; with SCR, it recovers to 17.85.
Isolated injection (last 1/4 steps) not only improves dynamics (13.93→18.00) but also marginally enhances Face Similarity.
In user studies, PersonalVideo significantly outperforms competing methods across four dimensions: identity fidelity, text alignment, dynamics, and overall quality.

Highlights & Insights¶

Non-reconstruction reward training paradigm: The paper departs from the conventional "reconstruct reference image → inject identity" paradigm by applying reward feedback directly to generated videos, fundamentally eliminating the tuning-inference gap.
Elegant SCR design: Rather than directly constraining semantic content, SCR aligns the distribution of semantic scores, maintaining flexibility while avoiding interference with identity injection.
Isolated identity injection exploits the temporal structure of the denoising process (early steps = motion, late steps = appearance), yielding a simple yet effective design.
Compatibility with community LoRAs (e.g., cartoon or traditional art styles) provides substantial flexibility for practical applications.

Limitations & Future Work¶

The method cannot generate videos featuring multiple customized identities simultaneously, limited by the underlying T2V model's capabilities.
Performance is contingent on the quality and capacity of the base T2V model.
The choice of reward models may affect final results; the paper does not provide an in-depth comparison across different reward model options.
Future work could explore decoupled attention maps to support multi-identity customization.

The approach shares conceptual connections with encoder-based methods such as PuLID, both employing ID loss rather than reconstruction loss; PersonalVideo extends this paradigm to the video domain.
The distributional alignment idea underlying SCR is transferable to other fine-tuning scenarios that require preserving a model's original capabilities.
The simulated prompt augmentation strategy is worth adopting in other customization tasks.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐