Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal¶

Conference: ICCV 2025 arXiv: 2507.04692 Code: https://github.com/wanchang-yu/Structure-Guided-Diffusion-for-Portrait-Shadow-Removal Area: Image Generation Keywords: Portrait Shadow Removal, Diffusion Inpainting, Structure Map Guidance, Detail Recovery, Relighting Data Synthesis

TL;DR¶

This paper formulates portrait shadow removal as a diffusion inpainting problem. It trains an illumination-invariant structure extraction network to obtain structure maps free of shadow boundaries, uses these maps to guide an inpainting diffusion model for shadow region restoration, and applies a gradient-guided detail recovery diffusion model to reconstruct fine facial details. The proposed method substantially outperforms existing approaches on benchmark datasets.

Background & Motivation¶

Practical Need: Selfie portraits are frequently degraded by shadows cast by external objects, which not only impairs visual aesthetics but also hinders downstream tasks such as face detection and recognition. Professional shadow removal remains inaccessible to general users.

Limitations of Prior Work: - General image shadow removal methods (ShadowDiffusion, HomoFormer): Ill-suited for portraits, which have zero tolerance for distortion — even minor facial color or detail artifacts are unacceptable. - Synthetic data training methods (PSM, BSR): Simple brightness/saturation adjustments ignore facial geometry and albedo, resulting in poor real-world generalization. - GAN inversion methods (UPSR): Leverage StyleGAN2 priors but tend to alter facial identity. - Relighting methods (IC-light): Estimating environmental lighting and facial geometry from a single image is inherently difficult and prone to unnatural results.

Core Idea: Shadow removal is reformulated as inpainting — the key challenge lies in obtaining a facial structure map free of shadow boundaries to guide the restoration.

Method¶

Overall Architecture (Three-Stage Pipeline)¶

SE-Net: Extracts an illumination-invariant facial structure map from the shadowed image.
Structure-Guided Inpainting Diffusion Model: Restores the shadow regions.
Gradient-Guided Detail Recovery Diffusion Model: Recovers fine facial details.

Key Design 1: Illumination-Invariant Structure Extraction (SE-Net)¶

Training Data Synthesis: - Shadow-free portraits \(I\) are selected from CelebA. - Physically-based relighting generates \(I_{relit}\). - A random facial mask blends the two: \(I_{syn} = M \odot I_{relit} + (1-M) \odot I\)

Training Objective: An existing edge extraction model PDG generates pseudo ground-truth from the original image. SE-Net is trained to produce the same structure map from the synthesized image containing illumination discontinuities:

\[\mathcal{L}_{total} = \mathcal{L}_{rec} + \lambda_1\mathcal{L}_{perceptual} + \lambda_2\mathcal{L}_{GAN}\]

\[\mathcal{L}_{rec} = \|G_s(I_{syn}) - G_p(I)\|_1, \quad \mathcal{L}_{perceptual} = \mathcal{L}_{LPIPS}(G_s(I_{syn}), G_p(I))\]

Key Design 2: Structure-Guided Inpainting Diffusion Model¶

The denoising process is conditioned on the structure map \(S\) and the masked input \(I_M\):

\[x_{t-1} = \sqrt{\bar{\alpha}_{t-1}}\left(\frac{x_t - \sqrt{1-\bar{\alpha}_t}\cdot\mathbf{e}_t}{\sqrt{\bar{\alpha}_t}}\right) + \sqrt{1-\bar{\alpha}_{t-1}}\cdot\mathbf{e}_t\]

\[\mathbf{e}_t = \epsilon_\theta(x_t, I_M, S, M, t)\]

During training, random masks are applied to shadow-free portraits to learn conditional reconstruction.
DDIM sampling is employed for acceleration; the structure map ensures structurally consistent restoration.
Mask Refinement: The difference between the inpainted result and the original image is thresholded via Otsu's method to produce a refined mask, excluding non-shadow regions from modification.

Key Design 3: Gradient-Guided Detail Recovery¶

The structure extraction model focuses on large-scale structure and may miss fine details such as eyelashes, moles, and freckles. The proposed solution: - Extracts image gradients from the original shadow region as a guidance condition. - Trains a second diffusion model conditioned on gradient maps to refine the inpainting result. - The forward/reverse processes share the same formulation as the structure-guided diffusion model, differing only in conditioning.

Experiments¶

Quantitative Comparison (Real Portrait Shadow Dataset)¶

Method	SSIM↑	LPIPS↓	RMSE↓	Shadow SSIM↑	Shadow RMSE↓
ShadowDiffusion	0.650	0.177	38.25	0.901	15.56
Inpaint4Shadow	0.766	0.094	22.71	0.910	17.25
HomoFormer	0.786	0.100	21.72	0.913	16.25
IC-light	0.514	0.278	62.54	0.916	33.06
UPSR (GAN)	0.731	0.109	26.92	0.900	20.66
Ours (full)	0.830	0.056	17.16	0.973	10.20

Ablation Study¶

Configuration	SSIM↑	LPIPS↓	RMSE↓
w/o structure guidance	0.757	0.101	23.60
PDG structure (with shadow boundaries)	0.779	0.092	19.96
PSM data synthesis strategy	0.785	0.086	19.80
w/o detail recovery	0.814	0.062	18.54
Full method	0.830	0.056	17.16

Key Findings¶

The full method achieves an LPIPS of 0.056, substantially surpassing the second-best HomoFormer at 0.100.
Shadow-region SSIM reaches 0.973 (near-perfect), demonstrating that structure guidance ensures precise restoration.
Illumination invariance of SE-Net is critical: using PDG structure maps containing shadow boundaries yields only 0.779 SSIM.
The detail recovery diffusion model reduces LPIPS from 0.062 to 0.056, confirming the contribution of fine details such as eyelashes.
The method is robust to imprecise shadow masks — coverage of the shadow region is sufficient.

Highlights & Insights¶

Elegant problem formulation: Recasting shadow removal as inpainting fully exploits the generative capacity of diffusion models.
Unsupervised training: Relighting-based data synthesis eliminates the need for real paired shadowed/shadow-free portrait data.
Three-stage cascade: Each stage addresses a distinct core problem — structure extraction → region restoration → detail recovery.
Mask refinement strategy enhances robustness to imprecise input masks.

Limitations & Future Work¶

Inference requires three cascaded models, resulting in slower processing.
The method depends on shadow mask input, necessitating an additional shadow detection step.
A domain gap remains between synthesized training data and real-world shadows.

Shadow Removal: ShadowFormer, HomoFormer, Inpaint4Shadow
Portrait Shadow: PSM, BSR, UPSR
Diffusion Inpainting: RePaint, MAT

Rating¶

Novelty: ⭐⭐⭐⭐ — Reformulating shadow removal as inpainting offers a fresh perspective.
Technical Depth: ⭐⭐⭐⭐ — The three-stage design is tightly integrated.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive ablation on real-world data.
Practical Value: ⭐⭐⭐⭐ — Applicable to selfie enhancement and face recognition preprocessing.