Zero-Shot Head Swapping in Real-World Scenarios¶

Conference: CVPR 2025
arXiv: 2503.00861
Code: None
Area: Others
Keywords: Head Swapping, Zero-shot, Diffusion Models, Automatic Mask Generation, Hairstyle Injection

TL;DR¶

The paper proposes HID (Head Injection Diffusion), a zero-shot head swapping method that achieves seamless head-body fusion by automatically generating context-aware editing masks through IOMask. It also introduces a hair injection module to precisely transfer hairstyle details, achieving SOTA performance in real-world scenarios containing upper bodies and multi-angle faces.

Background & Motivation¶

Unlike face swapping which only replaces the facial identity (ID), head swapping requires seamlessly fusing the entire head (including facial ID, head shape, and hairstyle) from the head image onto the body image, which is significantly more complex.

Existing head swapping methods face three key limitations: (1) Dependence on cropped face data — most methods (e.g., FaceX, REFace) only operate on centrally cropped face images, which easily leads to incongruency (such as residual long hair from the original image or color inconsistency) when the swapped head needs to be pasted back onto the full body; (2) Inflexible masking methods — existing masks are optimized for cropped data and fail to handle complex scenarios (e.g., long hair extending beyond the cropped region); (3) Poor viewpoint robustness — most methods only handle frontal views, lacking support for multiple angles such as profile views.

The core contribution of this work is the design of a zero-shot method that directly performs head swapping on full images containing the upper body, automatically generating context-adaptive masks.

Method¶

Overall Architecture¶

HID is built upon PhotoMaker V2 and operates in two stages: (1) Left Stage — extracting identity and hairstyle embeddings via the ID Fusion model and the Hair Fusion model respectively, which then replace corresponding parts in the text embedding; (2) Right Stage — performing DDIM inversion on the body image to obtain the latent representation, determining the head region to be edited using IOMask, and generating the final result through conditional denoising under ControlNet (OpenPose) constraints.

Key Designs¶

Key Design 1: IOMask (Inverted Orthogonal Mask)¶

Function: Automatically generates context-aware head editing masks without human annotation
Mechanism: Performs DDIM inversion on the body image up to a specific timestep \(t\), and computes the body-conditional noise \(\epsilon_\theta(C_b)\) and the head-conditional noise \(\epsilon_\theta(C_h, \varnothing)\) separately. The orthogonal component of the latter relative to the former is extracted: \(\epsilon_\theta^{orth} = \epsilon_\theta(C_h,\varnothing) - \frac{\langle\epsilon_\theta(C_b), \epsilon_\theta(C_h,\varnothing)\rangle}{\|\epsilon_\theta(C_b)\|^2}\epsilon_\theta(C_b)\). A binary mask is then obtained through Gaussian filtering and thresholding.
Design Motivation: Direct subtraction \(\epsilon_\theta(C_h)-\epsilon_\theta(C_b)\) produces random noise. The orthogonal component calculation ensures that regions with consistent direction (the body parts) yield smaller values, while regions with differing directions (the head region to be replaced) yield larger values, serving as a more accurate indicator for the editing region.

Key Design 2: Hair Injection Module¶

Function: Precisely transfers hairstyle features from the head image to the generated result
Mechanism: Inspired by PhotoMaker V2, a Hair Fusion model is trained: utilizing a Q-Former and an MLP, it fuses hairstyle features from the CLIP image encoder with the hairstyle embedding extracted by a pre-trained Hair Encoder. The fused embedding replaces the corresponding position of "hairstyle" in the text embedding. During training, body parsing masks generated by SCHP are used to reconstruct only the person region.
Design Motivation: PhotoMaker V2 focuses on preserving facial ID and cannot guarantee accurate hairstyle transfer. With a dedicated hair injection module, the hairstyle information is expressed independently and precisely.

Key Design 3: Head Injection Diffusion Process¶

Function: Seamlessly fuses the new head with the original body during the denoising process
Mechanism: Denoising starts from the inverted latent \(\hat{z}_T\) obtained through DDIM inversion, and blending is performed at each step using the IOMask: \(z_{t-1} = \tilde{z}_{t-1} \odot \mathcal{M} + \hat{z}_{t-1} \odot (1-\mathcal{M})\). The region outside the mask always retains the inverted latent of the original body, while the region inside the mask is generated by denoising conditioned on the head.
Design Motivation: Starting from the inverted latent rather than pure noise ensures the consistency of details such as skin tone and clothing; gradual blending rather than a one-time paste-up guarantees a natural boundary transition.

Loss & Training¶

The standard diffusion model loss is used during the training phase. When training the Hair Fusion module, a masked reconstruction loss based on the SCHP mask is added to ensure that only the person region is reconstructed.

Key Experimental Results¶

Main Results: Quantitative Comparison (SHHQ Dataset)¶

Method	FID↓	Head LPIPS↓	Head CLIP-I↑	Hair LPIPS↓	Hair CLIP-I↑
REFace	40.72	0.0770	0.7867	0.0658	0.8563
HID (Ours)	37.19	0.0721	0.8512	0.0596	0.8707

Ablation Study¶

Configuration	Result
Full HID	Optimal head swapping, precise hairstyle transfer, and fully preserved body
w/o Hair Injection	Loss of hairstyle details (e.g., long hair becoming short)
w/o IOMask	Almost complete loss of body image information

IOMask Variant Comparison¶

IO Map Variant	Result
Naive IO map	Large amount of random noise covering irrelevant regions
w/o orthogonal	Fairly good but the region is not precise enough
Full IO map	Accurately focused on the head region requiring replacement

Key Findings¶

The orthogonal component calculation of IOMask significantly reduces noise artifacts compared to simple subtraction.
Generating from the inverted latent is key to maintaining body consistency.
Operating on full images including the upper body avoids the incongruency issues of cropping and pasting back.
HID surpasses REFace, the only comparable zero-shot method, across all metrics.

Highlights & Insights¶

Orthogonal component concept of IOMask: Utilizes the geometric relationship of noise predictions to automatically infer the editing region, which is more efficient than attention-map-based methods.
Problem Redefinition: Conducting head swapping on full images that include the upper body fits practical requirements much better than face-cropping methods.
Modular Design: ID injection and hairstyle injection are processed independently, with each focusing on its respective task.

Limitations & Future Work¶

Quantitative comparison is only conducted with REFace (the only open-source zero-shot method), representing a rather thin baseline for comparison.
The threshold \(\tau\) of IOMask needs to be tuned and might be sensitive to different scenarios.
Extreme scenarios (such as occlusions like wearing hats, headscarves, etc.) have not been explored.

The orthogonal component masking concept of IOMask can be generalized to other diffusion-based editing tasks (such as clothing replacement, accessory editing).
The design of the Hair Injection module can be applied to other generative tasks requiring precise hairstyle control.

Rating¶

⭐⭐⭐⭐ — Clear motivation with a cleverly designed IOMask that leverages the geometric properties of noise prediction in diffusion models. Advancing head swapping to real-world scenarios (non-cropped images) is a contribution of practical value. However, the comparative experiments are somewhat limited.