OmniPortrait: Fine-Grained Personalized Portrait Synthesis via Pivotal Optimization¶
Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=DVmR3Ij0ap
Code: To be confirmed
Area: Diffusion Models / Image Generation
Keywords: Portrait customization, Identity fidelity, Pivotal optimization, Test-time guidance, Diffusion feature matching
TL;DR¶
OmniPortrait decomposes "identity customization" into two coarse-to-fine steps: first, a frozen denoiser and an encoder-only Pivot ID Encoder provide a coarse-grained identity "pivot"; then, during inference, a training-free RB-Guidance performs reference image matching and gradient optimization on intermediate diffusion features. This captures fine-grained details of the reference face without compromising text editability, achieving new SOTA in both identity similarity (SIM) and text alignment (CLIP-T).
Background & Motivation¶
Background: Within text-to-image diffusion models, personalized portrait generation follows two primary paths: test-time fine-tuning (e.g., DreamBooth, Textual Inversion) or adding identity encoders to the diffusion condition space (e.g., IP-Adapter, InstantID, PhotoMaker) to inject reference features. The latter has become mainstream as it requires only a single reference image without per-person training.
Limitations of Prior Work: These "single-stream" injection methods capture coarse identity concepts (gender, face shape) but fail to preserve fine-grained details like moles or specific textures. The loss of detail results in over-smoothed or artificial-looking images. Conversely, methods like FastComposer that use full fine-tuning destroy the rich priors of pre-trained models, leading to abnormal scene composition and a sharp decline in text editability.
Key Challenge: A trade-off exists between identity fidelity and text editability. Fine-tuning the denoiser improves fidelity but damages priors; using only an encoder preserves editability but loses details. The paper illustrates this trade-off curve using the delayed conditioning parameter \(\alpha\) in FastComposer: increasing \(\alpha\) improves identity but causes text control to fail.
Goal: Achieve both coarse-grained identity consistency and fine-grained facial details while preserving text alignment, all without modifying denoiser parameters.
Key Insight: The authors borrow the idea of PTI (Pivotal Tuning Inversion) from GAN inversion—finding a "pivot" as a stable initialization and then performing local optimization around it. In diffusion customization, this translates to obtaining a coarse but reliable identity initialization, followed by fine-grained test-time optimization.
Core Idea: Utilize "pivotal optimization" for dual-stream, coarse-to-fine identity guidance. The first stream involves a frozen denoiser and a trained Pivot ID Encoder to provide an identity pivot; the second stream is the test-time RB-Guidance, which performs dense matching in the diffusion feature space and backpropagates gradients to refine details.
Method¶
Overall Architecture¶
OmniPortrait is built upon latent diffusion models (SD / SDXL) and extends condition injection into energy-based diffusion guidance. Given a reference face \(x_{ref}\) and a target prompt \(P_t\), the goal is to generate a portrait that matches the text scene while retaining reference details. The pipeline consists of two stages: Training Phase, where only a Pivot ID Encoder and a linear projection layer are trained (denoiser remains frozen), using a facial localization loss to constrain identity embeddings to the face region for a reliable "identity pivot"; Inference Phase, where the frozen encoder first provides a coarse-grained portrait, and then RB-Guidance is activated in the middle-to-late stages of diffusion (\(t \le 0.6T\)). RB-Guidance performs feature matching against the reference, formulates a similarity-based energy function, and optimizes noise latents via gradients to "pull" fine-grained details toward the reference identity.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Ref Image x_ref + Prompt"] --> B["Pivot ID Encoder + Localization Loss<br/>ID Features → Mixed Pivot Embedding"]
B --> C["Frozen Denoiser Coarse Generation<br/>Pivot Initialization + Face Localization"]
C -->|Activated at t ≤ 0.6T| D["RB-Guidance<br/>Diffusion Feature Matching + Background Gradient Mask"]
D --> E["On-the-fly Pivotal Optimization<br/>Energy Function Guides Latent Gradients"]
E --> F["Fine-grained Identity-Faithful Portrait"]
Key Designs¶
1. Pivot ID Encoder + Face Localization Loss: Establishing a Reliable Identity Pivot
To address the issue where fine-tuning denoisers damages priors, the authors keep the base model frozen. A visual encoder (OpenCLIP-ViT-L/14) extracts features \(e_{ref}\), which are concatenated with text embeddings \(e_{txt}\) and passed through a linear layer to produce the mixed pivot embedding \(e_{mix}\). This essentially "binds" the reference identity to the token that best represents the concept (e.g., woman/man).
To prevent \(e_{mix}\) from spreading across the entire body, the Face Localization Loss is introduced. It constrains the cross-attention map \(A_{t}(m) \in [0,1]^{16\times16}\) of \(e_{mix}\) at time \(t\) to approximate a scaled face segmentation mask \(M\):
where \(\eta = 2\times10^{-3}\). This constraint improves identity similarity and allows the face mask to be extracted directly from the attention maps during inference, providing an entry point for localized optimization.
2. RB-Guidance: Training-Free Detail Recovery via Diffusion Feature Matching
This step recovers details lost by single-stream methods without training. It performs three functions: - Region Mask Extraction: Instead of simple thresholding, it uses structure-rich self-attention maps \(A_s(m)\) for iterative refinement \(\hat{A}^c_t(m) = A_s(m)\cdot \text{norm}(\cdots\text{norm}(A^c_t(m))^\alpha\cdots)^\alpha\) with \(\gamma=3\) iterations and \(\alpha=2\), producing the face mask \(M_{gen}\) via threshold \(\beta=0.5\). - Diffusion Feature Correspondence: Utilizing the local correspondence of diffusion features (DIFT), reference features \(D^{ref}_{t_0}\) (at \(t_0=671\)) are matched with generated features \(D^{gen}_t\) inside the mask: \(p_{gen} = \arg\min_{p\in M_{ref}} d(D^{ref}_{t_0}[p_{ref}], D^{gen}_t[p])\). The matching score \(S_{dift}\) is calculated via cosine similarity. - Background Gradient Masking (BGM): To prevent gradients from blurring the background, the score gradient is filtered using \(M_{gen}\): \(\hat{S}_{dift} = (\cdots)\odot M_{gen}\). This ensures only the face region is optimized.
3. On-the-fly Pivotal Optimization: Guiding Gradients via Energy Functions
Applying matching from the start would cause divergence because the early-stage generation lacks a stable structure. The matching score is wrapped into an energy function:
The energy gradient \(\nabla_{z_t}\mathcal{E}\) is added to the classifier-free guidance noise prediction only when \(t \le \hat{t} = uT\) (where \(u=0.6\)):
Here, \(p\) controls optimization strength (set to 8.5). This localized refinement around the "pivot" ensures stability and effectiveness.
Loss & Training¶
Only the Pivot ID Encoder and linear layer are optimized using AdamW with a learning rate of \(2\times10^{-5}\). The SD version was trained for 80k steps and the SDXL version for 60k steps (filtered to 1024 resolution) on 4×A100-80GB GPUs. 10% dropout is applied to text and ID conditions for CFG support. Inference uses \(w=7.5\) and 50 steps of sampling.
Key Experimental Results¶
Main Results¶
Evaluated on 50 reference images (CelebA-HQ / FFHQ) with 30 prompts each, covering text editability (CLIP-T), identity fidelity (CLIP-I, SIM), and image quality (IQA, FID).
| Method | Plug-and-Play | CLIP-T↑ | CLIP-I↑ | SIM↑ | IQA↑ | FID↓ |
|---|---|---|---|---|---|---|
| DreamBooth | ✗ | 23.32 | 66.45 | 60.07 | 85.88 | 219.40 |
| InstantID | ✓ | 22.26 | 73.03 | 68.35 | 84.17 | 221.62 |
| IP-Adapter | ✓ | 23.93 | 68.23 | 64.19 | 88.15 | 211.15 |
| Ours | ✓ | 24.25 | 73.08 | 69.13 | 86.80 | 213.48 |
OmniPortrait achieves the best performance in both identity fidelity and text alignment. Unlike InstantID, which acts like "copy-paste" and loses text control, OmniPortrait does not sacrifice one for the other.
Ablation Study¶
Ablation on SD version:
| Configuration | CLIP-T↑ | CLIP-I↑ | SIM↑ | FID↓ | Description |
|---|---|---|---|---|---|
| w/o \(\mathcal{L}_{loc}\) | 22.11 | 46.54 | 35.01 | 476.22 | ID injection misaligned; regions outside face destroyed |
| w/o PIE | 23.19 | 21.83 | 14.15 | 383.17 | No pivot; gradients disperse; no ID enhancement |
| w/o BGM | 19.34 | 37.30 | 33.42 | 483.93 | Gradient leakage blurs background |
| w/o RB-Guidance | 24.88 | 66.10 | 63.87 | 210.45 | Only coarse consistency; lacks facial details |
| Full | 24.25 | 73.08 | 69.13 | 213.48 | Complete Model |
Key Findings¶
- PIE (Identity Pivot) is the Foundation: Without it, SIM drops from 69.13 to 14.15 because RB-Guidance cannot locate or match features without a reliable initialization.
- BGM Determines Quality: Masking gradients is essential to prevent blurring (FID 483 without BGM).
- RB-Guidance Handles Details: It is the key to recovering fine-grained details, with only a negligible marginal cost to text alignment.
- Timing \(u\) is Sensitive: \(u=0.6\) represents the optimal trade-off between stability and guidance effectiveness.
- Dataset Contribution: The authors released OmniPortrait-1M, a million-scale dataset with fine-grained facial and body annotations, filling the gap for high-quality multimodal face data.
Highlights & Insights¶
- Migration of PTI to Diffusion: The "pivotal optimization" paradigm avoids the dilemma of "modifying denoisers vs. ignoring details."
- Training-Free RB-Guidance: Leveraging DIFT for dense matching and energy-based guidance makes it plug-and-play for various base models.
- Dual-Purpose Localization Loss: Handles both training constraints and inference-time mask generation in one design.
- Effective Gradient Masking: A simple solution to the common problem of background blurring in test-time optimization.
Limitations & Future Work¶
- The method assumes a single face; robustness in multi-person or occlusion scenarios is not fully quantified.
- RB-Guidance increases inference overhead; real-time performance is likely lower than vanilla diffusion.
- Many hyperparameters (\(t_0\), \(u\), \(p\), \(\gamma\)) are set empirically, and their generalizability across different base models requires further verification.
Related Work & Insights¶
- vs. InstantID: OmniPortrait uses a dual-stream coarse-to-fine scheme to exceed InstantID's identity fidelity without the significant drop in text alignment.
- vs. FastComposer: By keeping the denoiser frozen, OmniPortrait avoids destroying pre-trained priors.
- vs. IP-Adapter: Adds a test-time detail-recovery mechanism that breaks the detail ceiling of single-stream encoders.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Clear originality in applying PTI to the diffusion dual-stream paradigm.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive baselines and ablations, though lacks quantitative multi-person or latency analysis.
- Writing Quality: ⭐⭐⭐⭐ Logical flow and clear explanations, though hyperparameter sensitivity could be more systematic.
- Value: ⭐⭐⭐⭐⭐ High practical value for plug-and-play high-fidelity generation and its large-scale dataset.