DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability¶
Basic Information¶
- Conference: ICCV 2025
- arXiv: 2503.06505
- Code: Not released
- Area: Image Generation
- Keywords: personalized image generation, multi-ID generation, facial editing, attention mechanism, diffusion models
TL;DR¶
DynamicID achieves zero-shot single/multi-identity personalized image generation through two core components — Semantic Activation Attention (SAA) and Identity-Motion Reconfigurer (IMR) — while maintaining high fidelity and flexible facial editability.
Background & Motivation¶
Personalized human image generation aims to preserve consistent identity from reference images while incorporating user-specified text prompts. Existing tuning-free methods suffer from two critical limitations:
Limited multi-ID generation: Most methods are designed for single-identity scenarios and face severe identity blending in multi-identity settings — facial features from different reference subjects become entangled during generation.
Insufficient facial editability: Existing methods do not explicitly disentangle identity features (facial structure, skin texture) from motion features (expression, pose), preventing flexible editing of facial attributes.
Existing mitigation strategies (e.g., FastComposer's local cross-attention, InstantFamily's masked cross-attention) are built on single-ID frameworks, leading to degradation of core functionality.
Method¶
Overall Architecture¶
DynamicID comprises three core components: a face encoder (extracting facial features), IMR (disentangling and reconfiguring facial motion and identity in feature space), and SAA (injecting processed features into the T2I model). A task-decoupled training paradigm is adopted:
- Anchoring stage: Joint training of SAA and the face encoder (single-ID datasets only).
- Reconfiguration stage: The above components are frozen; IMR is trained using the VariFace-10k dataset.
Semantic Activation Attention (SAA)¶
The softmax normalization in standard cross-attention forces each query to distribute a fixed total attention mass across all keys, even when no semantic relevance exists between a query and the keys. This leads to: (1) disruption of the original model behavior; and (2) identity blending in multi-ID scenarios.
SAA introduces a query-level activation gating mechanism:
where \(J \in \mathbb{R}^{k \times 1}\) is an all-ones column vector and \(\text{Norm}(\cdot)\) applies min-max normalization to \([0,1]\). The activation weight \(w\) reflects the semantic relevance of each query to the reference facial information:
- Facial-region queries → strong activation (close to 1)
- Background-region queries → suppressed (close to 0)
- Body-region queries → moderate activation
This yields three zero-shot generalization capabilities: - Context decoupling: Non-facial regions are not disturbed by facial information. - Layout control: Spatial layout can be controlled by modulating activation weights via masks. - Multi-ID personalization: When injecting facial information for one subject, queries in regions corresponding to other subjects are suppressed.
Identity-Motion Reconfigurer (IMR)¶
IMR consists of DisentangleNet \(\phi_1\) and EntangleNet \(\phi_2\), which disentangle and reconfigure identity and motion in feature space:
where motion features \(\psi\) are derived from facial prompts (text encoding) and facial keypoints (keypoint encoder), fused via an MLP.
Loss & Training¶
Anchoring stage uses the standard diffusion noise prediction loss:
Reconfiguration stage uses a dual-objective loss:
The first term is direct feature matching; the second is latent diffusion consistency (ensuring predicted features share the same semantics in the generative model's latent space), with \(\lambda=1\).
VariFace-10k Dataset¶
Constructed for IMR training: 10k distinct identities, each with 35 facial images spanning varied expressions, poses, and lighting conditions.
Key Experimental Results¶
Main Results: Single-ID Personalization Quantitative Comparison¶
| Method | Arch. | CLIP-T ↑ | FaceSim ↑ | Expr ↑ | Pose ↑ |
|---|---|---|---|---|---|
| PuLID-FLUX | FLUX | 0.237 | 0.667 | 0.181 | 0.273 |
| PhotoMaker-v2 | SDXL | 0.238 | 0.592 | 0.243 | 0.869 |
| InstantID | SDXL | 0.233 | 0.723 | 0.151 | 0.264 |
| IPA-FaceID-Plus | SD1.5 | 0.236 | 0.712 | 0.156 | 0.266 |
| MasterWeaver | SD1.5 | 0.237 | 0.651 | 0.189 | 0.278 |
| DynamicID (Ours) | SD1.5 | 0.239 | 0.671 | 0.456 | 0.878 |
DynamicID achieves substantial margins on Expr (+0.213) and Pose (+0.009 vs. PhotoMaker), demonstrating far superior facial editability. FaceSim is slightly lower than InstantID/IPA-FaceID, but the latter's high FaceSim scores reflect direct facial copying (as confirmed by their low Expr/Pose values).
Ablation Study: Contributions of SAA and IMR¶
| Method | CLIP-T ↑ | FaceSim ↑ | Expr ↑ | Pose ↑ |
|---|---|---|---|---|
| Ours w/o SAA | 0.224 | 0.682 | 0.422 | 0.862 |
| Ours w/o IMR | 0.228 | 0.712 | 0.161 | 0.253 |
| Ours (Full) | 0.239 | 0.671 | 0.456 | 0.878 |
- Removing SAA → significant drop in CLIP-T (0.224 vs. 0.239), indicating that SAA preserves the model's original text-editing capability.
- Removing IMR → sharp degradation in Expr/Pose (0.161/0.253 vs. 0.456/0.878), while FaceSim rises (direct copying of the reference face).
Multi-ID Personalization Comparison¶
| Method | CLIP-T ↑ | FaceSim ↑ | Expr ↑ | Pose ↑ |
|---|---|---|---|---|
| FastComposer | 0.233 | 0.594 | 0.144 | 0.256 |
| UniPortrait | 0.235 | 0.718 | 0.149 | 0.268 |
| StoryMaker | 0.219 | 0.678 | 0.147 | 0.296 |
| DynamicID | 0.237 | 0.664 | 0.431 | 0.867 |
DynamicID maintains a significant advantage in multi-ID scenarios, with Expr and Pose substantially outperforming all baselines.
Highlights & Insights¶
- Core insight behind SAA: The softmax normalization in standard cross-attention is the root cause of disrupted model behavior and identity blending; query-level activation gating is an elegant solution.
- Zero-shot multi-ID generalization: Multi-identity generation is achieved without multi-ID training data, relying solely on manipulation of SAA activation weights.
- Task-decoupled training: Splitting joint training into two stages reduces data requirements (Anchoring requires only single-ID data; IMR requires only multiple images of the same individual).
- Feature-space operations: IMR performs identity–motion disentanglement in latent space rather than pixel space, yielding computational efficiency and strong generalization.
Limitations & Future Work¶
- Built on SD1.5, generation quality is bounded by the base model's capacity; adaptation to stronger models such as SDXL/FLUX has not been explored.
- The diversity of the VariFace-10k dataset may be insufficient to cover all facial variations.
- Layout control for three or more identities requires manual bounding box specification.
- Inference uses 50-step DDIM, leaving room for efficiency improvement.
Related Work & Insights¶
- InstantID achieves high fidelity via ControlNet but lacks expression editing capability.
- PhotoMaker fuses facial features in the text embedding space, preserving partial editability at the cost of fidelity.
- IP-Adapter's decoupled cross-attention pioneered attention-based feature injection, albeit at a coarse granularity.
- The activation gating idea in SAA is generalizable to other conditional injection settings (e.g., style transfer, pose control).
Rating¶
⭐⭐⭐⭐ — The method is elegantly designed; in particular, SAA's query-level activation gating gracefully addresses both multi-ID generation and model behavior preservation. Facial editability metrics substantially surpass prior state of the art. The SD1.5 backbone, however, imposes an upper bound on generation quality.