OmniPortrait: Fine-Grained Personalized Portrait Synthesis via Pivotal Optimization¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=DVmR3Ij0ap
Code: To be confirmed
Area: Diffusion Models / Image Generation
Keywords: Portrait customization, Identity fidelity, Pivotal optimization, Test-time guidance, Diffusion feature matching

TL;DR¶

OmniPortrait decomposes "identity customization" into two coarse-to-fine steps: first, a frozen denoiser and an encoder-only Pivot ID Encoder provide a coarse-grained identity "pivot"; then, during inference, a training-free RB-Guidance performs reference image matching and gradient optimization on intermediate diffusion features. This captures fine-grained details of the reference face without compromising text editability, achieving new SOTA in both identity similarity (SIM) and text alignment (CLIP-T).

Background & Motivation¶

Background: Within text-to-image diffusion models, personalized portrait generation follows two primary paths: test-time fine-tuning (e.g., DreamBooth, Textual Inversion) or adding identity encoders to the diffusion condition space (e.g., IP-Adapter, InstantID, PhotoMaker) to inject reference features. The latter has become mainstream as it requires only a single reference image without per-person training.

Limitations of Prior Work: These "single-stream" injection methods capture coarse identity concepts (gender, face shape) but fail to preserve fine-grained details like moles or specific textures. The loss of detail results in over-smoothed or artificial-looking images. Conversely, methods like FastComposer that use full fine-tuning destroy the rich priors of pre-trained models, leading to abnormal scene composition and a sharp decline in text editability.

Key Challenge: A trade-off exists between identity fidelity and text editability. Fine-tuning the denoiser improves fidelity but damages priors; using only an encoder preserves editability but loses details. The paper illustrates this trade-off curve using the delayed conditioning parameter \(\alpha\) in FastComposer: increasing \(\alpha\) improves identity but causes text control to fail.

Goal: Achieve both coarse-grained identity consistency and fine-grained facial details while preserving text alignment, all without modifying denoiser parameters.

Key Insight: The authors borrow the idea of PTI (Pivotal Tuning Inversion) from GAN inversion—finding a "pivot" as a stable initialization and then performing local optimization around it. In diffusion customization, this translates to obtaining a coarse but reliable identity initialization, followed by fine-grained test-time optimization.

Core Idea: Utilize "pivotal optimization" for dual-stream, coarse-to-fine identity guidance. The first stream involves a frozen denoiser and a trained Pivot ID Encoder to provide an identity pivot; the second stream is the test-time RB-Guidance, which performs dense matching in the diffusion feature space and backpropagates gradients to refine details.

Method¶

Overall Architecture¶

OmniPortrait is built upon latent diffusion models (SD / SDXL) and extends condition injection into energy-based diffusion guidance. Given a reference face \(x_{ref}\) and a target prompt \(P_t\), the goal is to generate a portrait that matches the text scene while retaining reference details. The pipeline consists of two stages: Training Phase, where only a Pivot ID Encoder and a linear projection layer are trained (denoiser remains frozen), using a facial localization loss to constrain identity embeddings to the face region for a reliable "identity pivot"; Inference Phase, where the frozen encoder first provides a coarse-grained portrait, and then RB-Guidance is activated in the middle-to-late stages of diffusion (\(t \le 0.6T\)). RB-Guidance performs feature matching against the reference, formulates a similarity-based energy function, and optimizes noise latents via gradients to "pull" fine-grained details toward the reference identity.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Ref Image x_ref + Prompt"] --> B["Pivot ID Encoder + Localization Loss<br/>ID Features → Mixed Pivot Embedding"]
    B --> C["Frozen Denoiser Coarse Generation<br/>Pivot Initialization + Face Localization"]
    C -->|Activated at t ≤ 0.6T| D["RB-Guidance<br/>Diffusion Feature Matching + Background Gradient Mask"]
    D --> E["On-the-fly Pivotal Optimization<br/>Energy Function Guides Latent Gradients"]
    E --> F["Fine-grained Identity-Faithful Portrait"]

Key Designs¶

1. Pivot ID Encoder + Face Localization Loss: Establishing a Reliable Identity Pivot

To address the issue where fine-tuning denoisers damages priors, the authors keep the base model frozen. A visual encoder (OpenCLIP-ViT-L/14) extracts features \(e_{ref}\), which are concatenated with text embeddings \(e_{txt}\) and passed through a linear layer to produce the mixed pivot embedding \(e_{mix}\). This essentially "binds" the reference identity to the token that best represents the concept (e.g., woman/man).

To prevent \(e_{mix}\) from spreading across the entire body, the Face Localization Loss is introduced. It constrains the cross-attention map \(A_{t}(m) \in [0,1]^{16\times16}\) of \(e_{mix}\) at time \(t\) to approximate a scaled face segmentation mask \(M\):

\[\mathcal{L} = \mathcal{L}_{diff} + \eta \mathcal{L}_{loc},\quad \mathcal{L}_{loc} = \frac{1}{hw}\sum_{i,j}\big(A_t(i,j)-M(i,j)\big)^2,\]

where \(\eta = 2\times10^{-3}\). This constraint improves identity similarity and allows the face mask to be extracted directly from the attention maps during inference, providing an entry point for localized optimization.

2. RB-Guidance: Training-Free Detail Recovery via Diffusion Feature Matching

This step recovers details lost by single-stream methods without training. It performs three functions: - Region Mask Extraction: Instead of simple thresholding, it uses structure-rich self-attention maps \(A_s(m)\) for iterative refinement \(\hat{A}^c_t(m) = A_s(m)\cdot \text{norm}(\cdots\text{norm}(A^c_t(m))^\alpha\cdots)^\alpha\) with \(\gamma=3\) iterations and \(\alpha=2\), producing the face mask \(M_{gen}\) via threshold \(\beta=0.5\). - Diffusion Feature Correspondence: Utilizing the local correspondence of diffusion features (DIFT), reference features \(D^{ref}_{t_0}\) (at \(t_0=671\)) are matched with generated features \(D^{gen}_t\) inside the mask: \(p_{gen} = \arg\min_{p\in M_{ref}} d(D^{ref}_{t_0}[p_{ref}], D^{gen}_t[p])\). The matching score \(S_{dift}\) is calculated via cosine similarity. - Background Gradient Masking (BGM): To prevent gradients from blurring the background, the score gradient is filtered using \(M_{gen}\): \(\hat{S}_{dift} = (\cdots)\odot M_{gen}\). This ensures only the face region is optimized.

3. On-the-fly Pivotal Optimization: Guiding Gradients via Energy Functions

Applying matching from the start would cause divergence because the early-stage generation lacks a stable structure. The matching score is wrapped into an energy function:

\[\mathcal{E} = \frac{p}{1 + \hat{S}_{dift}(z_t, x_{ref}, M_{gen}, M_{ref})},\]

The energy gradient \(\nabla_{z_t}\mathcal{E}\) is added to the classifier-free guidance noise prediction only when \(t \le \hat{t} = uT\) (where \(u=0.6\)):

\[\hat{\epsilon}_y(z_t) = \begin{cases} \epsilon_\theta(z_t,t,\varnothing)+w(\epsilon_\theta(z_t,t,y)-\epsilon_\theta(z_t,t,\varnothing)), & t > \hat{t}\\ \cdots + \nabla_{z_t}\mathcal{E}, & t \le \hat{t}\end{cases}\]

Here, \(p\) controls optimization strength (set to 8.5). This localized refinement around the "pivot" ensures stability and effectiveness.

Loss & Training¶

Only the Pivot ID Encoder and linear layer are optimized using AdamW with a learning rate of \(2\times10^{-5}\). The SD version was trained for 80k steps and the SDXL version for 60k steps (filtered to 1024 resolution) on 4×A100-80GB GPUs. 10% dropout is applied to text and ID conditions for CFG support. Inference uses \(w=7.5\) and 50 steps of sampling.

Key Experimental Results¶

Main Results¶

Evaluated on 50 reference images (CelebA-HQ / FFHQ) with 30 prompts each, covering text editability (CLIP-T), identity fidelity (CLIP-I, SIM), and image quality (IQA, FID).

Method	Plug-and-Play	CLIP-T↑	CLIP-I↑	SIM↑	IQA↑	FID↓
DreamBooth	✗	23.32	66.45	60.07	85.88	219.40
InstantID	✓	22.26	73.03	68.35	84.17	221.62
IP-Adapter	✓	23.93	68.23	64.19	88.15	211.15
Ours	✓	24.25	73.08	69.13	86.80	213.48

OmniPortrait achieves the best performance in both identity fidelity and text alignment. Unlike InstantID, which acts like "copy-paste" and loses text control, OmniPortrait does not sacrifice one for the other.

Ablation Study¶

Ablation on SD version:

Configuration	CLIP-T↑	CLIP-I↑	SIM↑	FID↓	Description
w/o \(\mathcal{L}_{loc}\)	22.11	46.54	35.01	476.22	ID injection misaligned; regions outside face destroyed
w/o PIE	23.19	21.83	14.15	383.17	No pivot; gradients disperse; no ID enhancement
w/o BGM	19.34	37.30	33.42	483.93	Gradient leakage blurs background
w/o RB-Guidance	24.88	66.10	63.87	210.45	Only coarse consistency; lacks facial details
Full	24.25	73.08	69.13	213.48	Complete Model

Key Findings¶

PIE (Identity Pivot) is the Foundation: Without it, SIM drops from 69.13 to 14.15 because RB-Guidance cannot locate or match features without a reliable initialization.
BGM Determines Quality: Masking gradients is essential to prevent blurring (FID 483 without BGM).
RB-Guidance Handles Details: It is the key to recovering fine-grained details, with only a negligible marginal cost to text alignment.
Timing \(u\) is Sensitive: \(u=0.6\) represents the optimal trade-off between stability and guidance effectiveness.
Dataset Contribution: The authors released OmniPortrait-1M, a million-scale dataset with fine-grained facial and body annotations, filling the gap for high-quality multimodal face data.

Highlights & Insights¶

Migration of PTI to Diffusion: The "pivotal optimization" paradigm avoids the dilemma of "modifying denoisers vs. ignoring details."
Training-Free RB-Guidance: Leveraging DIFT for dense matching and energy-based guidance makes it plug-and-play for various base models.
Dual-Purpose Localization Loss: Handles both training constraints and inference-time mask generation in one design.
Effective Gradient Masking: A simple solution to the common problem of background blurring in test-time optimization.

Limitations & Future Work¶

The method assumes a single face; robustness in multi-person or occlusion scenarios is not fully quantified.
RB-Guidance increases inference overhead; real-time performance is likely lower than vanilla diffusion.
Many hyperparameters (\(t_0\), \(u\), \(p\), \(\gamma\)) are set empirically, and their generalizability across different base models requires further verification.

vs. InstantID: OmniPortrait uses a dual-stream coarse-to-fine scheme to exceed InstantID's identity fidelity without the significant drop in text alignment.
vs. FastComposer: By keeping the denoiser frozen, OmniPortrait avoids destroying pre-trained priors.
vs. IP-Adapter: Adds a test-time detail-recovery mechanism that breaks the detail ceiling of single-stream encoders.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Clear originality in applying PTI to the diffusion dual-stream paradigm.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive baselines and ablations, though lacks quantitative multi-person or latency analysis.
Writing Quality: ⭐⭐⭐⭐ Logical flow and clear explanations, though hyperparameter sensitivity could be more systematic.
Value: ⭐⭐⭐⭐⭐ High practical value for plug-and-play high-fidelity generation and its large-scale dataset.