MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization¶

Conference: ICCV 2025 arXiv: 2503.12689 Code: echopluto.github.io/MagicID-project Area: Video Generation / Preference Alignment Keywords: Video Customization, Identity Consistency, Preference Optimization, Diffusion Models, Hybrid Sampling

TL;DR¶

This paper proposes MagicID, a framework that constructs hybrid video pair data capturing identity and dynamic preferences, and designs a two-stage Hybrid Preference Optimization (HPO) training strategy. MagicID is the first work to apply DPO to identity-customized video generation, simultaneously addressing identity degradation and motion weakening caused by conventional self-reconstruction training.

Background & Motivation¶

Problem Definition¶

Video identity customization aims to generate high-fidelity videos that preserve a specific identity with significant dynamics, given a small number of user-provided reference images. Compared to image customization, the core challenge in video customization lies in the fact that reference inputs are static images rather than videos.

Limitations of Prior Work¶

Existing methods (MagicMe, DreamBooth) follow the self-reconstruction training paradigm from image customization—learning to reconstruct reference images to preserve identity—which introduces two severe problems:

Identity degradation worsens with increasing frame count: Reference images are inherently single frames, and there exists an intrinsic temporal resolution gap between them and multi-frame videos. Self-reconstruction training cannot bridge this domain shift, causing identity consistency to degrade noticeably as more frames are generated.

Dynamics weaken as training progresses: Since self-reconstruction only reconstructs static images, it drives the model to generate increasingly "static" videos. Experiments show that as customization training steps increase, the dynamic degree of generated videos continuously declines.

Core Idea¶

Replace self-reconstruction training with preference optimization: construct "good/bad" video pairs and let the model directly learn to generate identity-consistent and motion-rich videos. A hybrid sampling strategy resolves the conflict between identity and dynamic objectives in a stage-wise manner.

Method¶

Overall Architecture¶

MagicID consists of three stages: 1. Initial LoRA fine-tuning (1,000 steps, conventional self-reconstruction) 2. Preference data construction (video generation + reward evaluation + hybrid pair selection) 3. Hybrid Preference Optimization training (5,000 steps, HPO loss)

The backbone model is HunyuanVideo (a T2V DiT model).

Key Designs¶

1. Preference Video Data Generation¶

Function: Construct video pairs exhibiting differences in identity consistency and dynamic degree
Mechanism: Build a base video pool \(\mathcal{B}\) from three sources:
- \(V_t\): Videos generated by the LoRA fine-tuned model (carry identity information but may be imperfect)
- \(V_s\): Videos generated by the original T2V model (without LoRA, preserving the original dynamic distribution)
- \(V_{id}\): Static videos expanded from reference images (perfect identity but zero dynamics)
Design Motivation: Pairing only generated samples limits the model's ability to learn identity information unseen in reference images; static videos serve as identity "anchors."

2. Customized Video Reward¶

Function: Evaluate video quality along three dimensions
Mechanism:
- \(R_{id}\) (Identity Consistency): Computes facial similarity to reference images using a pretrained ArcFace encoder
- \(R_{dy}\) (Dynamic Degree): Analyzes inter-frame motion intensity using the RAFT optical flow model
- \(R_{sem}\) (Semantic Alignment): Evaluates semantic correspondence between video and text prompt using a VLM
- All scores are normalized to a 1–10 scale
Design Motivation: Identity, dynamics, and semantics are mutually constrained objectives that must be explicitly quantified to enable meaningful preference pairing.

3. Hybrid Pair Selection¶

Function: Construct preference video pairs in two stages
Mechanism:
- Stage 1 (Identity-First): Select video pairs \(P_{id}\) from \(V_{id}\) and \(V_s\) with large identity consistency gaps, tolerating dynamic degree differences
- Stage 2 (Dynamic-First): Apply Pareto frontier sampling from \(V_s\) and \(V_t\) to select dynamic preference pairs \(P_{dy}\)
- Use non-dominated sorting to identify upper/lower Pareto frontiers
- Sort by identity consistency gap and retain the Top-100 pairs
- Final set \(P = P_{dy} \cup P_{id}\)
Design Motivation: Simultaneously pursuing identity and dynamics creates conflicting objectives; Pareto frontier sampling ensures that selected pairs are discriminative along both dimensions.

Loss & Training¶

Hybrid Preference Optimization (HPO) loss:

\[\mathcal{L}_{\text{HPO}}(\theta) = \mathbb{E}_{(v_0^w, v_0^l) \sim \mathcal{D}, t \sim \{1..T\}} \left[ \beta \log \sigma \left( \left(\|\epsilon_w - \epsilon_\theta(v_t^w, t)\|^2 - \|\epsilon_w - \epsilon_{\text{ref}}(v_t^w, t)\|^2\right) - \left(\|\epsilon_l - \epsilon_\theta(v_t^l, t)\|^2 - \|\epsilon_l - \epsilon_{\text{ref}}(v_t^l, t)\|^2\right) \right) \right]\]

High-dimensional video sequence probabilities are converted to noise prediction error comparisons via the ELBO and Jensen's inequality.
Backbone: HunyuanVideo + LoRA; optimizer: AdamW; learning rate: 2e-5.
Total training: 6,000 steps (1,000 self-reconstruction + 5,000 HPO).

Key Experimental Results¶

Main Results¶

Method	Face Sim.↑	Dyna. Deg.↑	T. Cons.↑	CLIP-T↑	FVD↓
DreamBooth	0.276	5.690	0.9922	25.83	1423.55
MagicMe	0.322	5.332	0.9924	25.42	1438.66
IDAnimator	0.433	10.33	0.9938	25.21	1558.33
ConsisID	0.482	9.26	0.9811	26.12	1633.21
MagicID	0.600	14.42	0.9933	26.28	1228.33

Ablation Study (Hybrid Pair Selection)¶

Configuration	Face Sim.↑	Dynamic↑	CLIP-T↑	Note
No preference (self-recon)	0.276	5.690	25.83	Baseline DreamBooth
+ Identity preference pairs	0.605	7.382	25.94	Large identity gain
+ Identity + Dynamic preference pairs	0.600	14.42	26.28	Dynamic degree doubled

Ablation Study (Reward Combination)¶

ID	Dynamic	Semantic	Face↑	Dynamic↑	CLIP-T↑
✓			0.598	6.332	24.92
✓	✓		0.607	12.33	25.73
✓	✓	✓	0.600	14.42	26.28

Key Findings¶

Preference optimization substantially outperforms self-reconstruction: Face Similarity improves from 0.276 to 0.600 (+117%); Dynamic Degree improves from 5.69 to 14.42 (+153%).
Complementarity of the two-stage hybrid strategy: The identity preference stage focuses on improving Face Sim., while the dynamic preference stage doubles dynamic degree without compromising identity.
Each reward dimension contributes independently: The dynamic reward yields the largest gain in motion quality; the semantic reward additionally improves prompt-following ability.
MagicID does not require large-scale video training data: Unlike IDAnimator and ConsisID (which require thousands of high-quality person videos), MagicID relies only on a small number of reference images.

Highlights & Insights¶

First application of DPO to identity-customized video generation: The paper clearly diagnoses two core deficiencies of self-reconstruction training (identity degradation and motion weakening) and provides a principled solution via preference optimization.
Pareto frontier sampling: A novel approach for selecting preference pairs in multi-objective optimization settings, ensuring balance across both dimensions.
Static video as an identity anchor: Expanding reference images into static videos and including them in the preference pool is a simple yet effective design choice.
Thorough analysis: Quantitative analysis of frame count vs. identity consistency and training steps vs. dynamic degree intuitively demonstrates the failure modes of self-reconstruction training.

Limitations & Future Work¶

Single-identity customization only: The method cannot generate videos featuring multiple custom identities.
Dependency on pretrained identity encoder: The quality of ArcFace directly affects preference data construction.
Non-trivial preference data construction cost: Generating a large number of candidate videos and evaluating rewards for each is computationally expensive.
Validated only on HunyuanVideo: Generalizability to other T2V backbone models remains unknown.
61-frame video generation: Identity preservation in longer videos requires further investigation.

DreamBooth and MagicMe represent the inherent limitations of conventional self-reconstruction approaches.
ConsisID uses a face adapter to encode identity information but suffers from copy-paste artifacts (unnatural motion combined with a static texture overlay effect).
HuViDPO introduces DPO into general T2V generation; this work extends it to the customization setting with additional identity constraints.
The Pareto frontier sampling strategy is generalizable to other multi-objective video optimization tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ — First application of DPO to video identity customization; the hybrid sampling strategy is elegantly designed.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers quantitative, qualitative, user study, and ablation evaluations, though the number of baselines is somewhat limited.
Writing Quality: ⭐⭐⭐⭐ — Problem diagnosis is clear; the two-stage framework follows a natural logical progression.
Value: ⭐⭐⭐⭐ — Establishes a new training paradigm for video customization; the dual improvement in identity and dynamics has practical application value.