DynFaceRestore: Balancing Fidelity and Quality in Diffusion-Guided Blind Face Restoration¶

Conference: ICCV 2025 arXiv: 2507.13797 Code: Project Page Area: Diffusion Models / Face Restoration Keywords: Blind face restoration, diffusion model guidance, dynamic blur mapping, fidelity-quality balance, region-adaptive guidance

TL;DR¶

This paper proposes DynFaceRestore, which reformulates blind degradation as a Gaussian deblurring problem via Dynamic Blur Level Mapping (DBLM), and achieves an optimal fidelity-quality trade-off during diffusion model sampling through a Dynamic Starting Step lookup table (DSST) and a Dynamic Guidance Scaling Adjuster (DGSA).

Background & Motivation¶

Blind Face Restoration (BFR) aims to recover high-fidelity, detail-rich facial images from low-quality inputs with unknown degradation sources. The core challenge lies in simultaneously enhancing facial details and preserving identity consistency. While pre-trained diffusion models have been widely adopted as image priors for generating fine-grained details, existing methods exhibit three critical limitations:

Fixed diffusion starting step: Methods such as DifFace assume uniform degradation severity across all low-quality inputs and apply a fixed diffusion sampling starting timestep. This leads to "under-diffusion" (insufficient detail) for severely degraded images and "over-diffusion" (artifact introduction) for mildly degraded ones, as clearly demonstrated by the t-SNE visualization in Fig. 2 of the paper.

Kernel mismatch: Under blind settings, degradation kernel estimation is inherently imprecise, and real-world degradation kernels are highly complex (combining blur, downsampling, noise, JPEG compression, etc.). Modeling them as a single kernel introduces guidance bias during diffusion sampling and degrades restoration fidelity.

Global guidance scaling: Existing guidance-based methods (e.g., DPS, PGDiff) apply a uniform guidance scaling factor to all pixels. However, high-frequency regions (hair, wrinkles) benefit from stronger diffusion model influence for perceptual quality enhancement, whereas low-frequency regions (facial contours) require stronger observation guidance to preserve structural fidelity. A global scaling factor cannot reconcile this spatial conflict.

Method¶

Overall Architecture¶

The DynFaceRestore framework consists of three core components that collectively reformulate BFR as a Gaussian deblurring problem: 1. DBLM (Dynamic Blur Level Mapping): transforms unknown degradation inputs into Gaussian-blurred images 2. DSST (Dynamic Starting Step lookup table): determines the optimal diffusion starting step based on blur level 3. DGSA (Dynamic Guidance Scaling Adjuster): adaptively adjusts guidance intensity in a region-aware manner

The guided diffusion sampling process is formulated as: $$x_{t-1} = x'_{t-1} - A_t \times \nabla_{x_t} \| \acute{y} - k_t \otimes x^0_t \|^2$$

Key Designs¶

Dynamic Blur Level Mapping (DBLM): transforms the blind degradation input $y$ into the Gaussian-blurred form $\acute{y} = k^{\hat{std^*}}_y \otimes RM(y)$, where $RM$ can be any pre-trained restoration model (SwinIR is used in this work). The core idea is not to perfectly estimate the complex degradation kernel, but to convert it into a known Gaussian kernel form, thereby providing reliable guidance during diffusion sampling. The optimal standard deviation $std^*$ is estimated by an SE network comprising two sub-modules: a Transfer Model (TM) and a Standard Deviation Estimator (SDE). A key advantage of DBLM is that even when $RM$ produces imperfect restorations, re-applying Gaussian blur confines the residual error within a controllable range, effectively mitigating the kernel mismatch problem.
Dynamic Starting Step Lookup Table (DSST): based on the key observation that a high-quality image $x_0$ and its blurred counterpart $\tilde{y}^{std}_0$ statistically converge at some timestep $t$ during forward diffusion. This convergence point serves as the optimal guidance insertion step for the blurred observation $\acute{y}$. The formulation is:

$t_{std} = \underset{t}{argmin} \left( log(\mathbf{X}_t) - log(\tilde{\mathbf{Y}}^{std}_t) \leq tol \right)$

The optimal starting step for each $std$ value is pre-computed and stored as a lookup table. At inference, $t_{start}$ is retrieved directly using the SE-estimated $\hat{std^*}$, avoiding both under- and over-diffusion. Experiments show that the sampling range is reduced from a fixed 1000 steps to [690, 925].

Dynamic Guidance Scaling Adjuster (DGSA): a lightweight CNN (3 convolutional layers) that outputs a region-wise guidance scaling map $A_t \in [0,1]$. Inputs include the current measurement $\acute{y}$, the high-quality prediction $x^0_t$, and timestep $t$. In high-frequency texture regions (hair, wrinkles), smaller $A_t$ values are produced (weakening guidance to allow the diffusion model to freely generate details), whereas in low-frequency structural regions (contours, skin), larger $A_t$ values are produced (strengthening guidance to preserve fidelity). Training employs stationary wavelet transform (SWT) sub-band supervision and DISTS perceptual loss:

$L_{DGSA} = \sum_i \gamma_i \mathbb{D}(SWT(x^0_{t-1})_i, SWT(x_0)_i) + DISTS(x^0_{t-1}, x_0)$

Loss & Training¶

SE network: SDE is first pre-trained to estimate Gaussian blur levels, followed by end-to-end training of SE = TM + SDE
DGSA: timestep $t$ is randomly sampled; SWT sub-band L1 loss + DISTS perceptual loss is applied
Kernel adaptive update: the estimated kernel is updated during sampling as $std_{t-1} = std_t - s \nabla_{std_t} \| \acute{y} - k_t \otimes x^0_t \|^2$
Multi-guidance extension: optionally generates 3 guided outputs at different blur levels ($\hat{std^*}$, $\hat{std^*}-1$, $\hat{std^*}-2$) and combines them with weighted aggregation to balance DM quality and RM fidelity
The pre-trained diffusion model is shared with DifFace/PGDiff; training is conducted on the FFHQ dataset

Key Experimental Results¶

Main Results¶

Quantitative comparison on CelebA-Test:

Type	Method	PSNR↑	SSIM↑	FID↓	IDA↓	LMD↓
GAN	GPEN	23.77	0.659	30.25	0.837	6.377
GAN	GFP-GAN	22.84	0.620	23.86	0.822	4.793
Codebook	CodeFormer	23.83	0.637	18.08	0.775	3.509
DM	DifFace	23.95	0.659	15.03	0.867	3.781
DM	DiffBIR	24.13	0.647	19.19	0.767	3.535
DM	3Diffusion	23.39	0.651	15.45	0.943	3.781
DM	DynFaceRestore	24.35	0.664	14.78	0.748	3.419

FID comparison on real-world datasets:

Method	LFW↓	WebPhoto↓	Wider↓
CodeFormer	52.35	83.19	38.80
DAEFR	47.53	75.45	36.72
DiffBIR	43.45	91.20	36.72
DynFaceRestore	42.52	95.32	36.05

Ablation Study¶

Component-wise ablation on CelebA-Test:

Setting	DBLM	Multi-guide	DSST	DGSA	PSNR↑	FID↓	IDA↓	Sampling Range
Baseline					11.13	55.78	1.461	1000
A	✓				24.99	18.30	0.725	1000
C	✓	✓	✓		25.11	19.79	0.724	[690,925]
E	✓		✓	✓	24.33	14.69	0.755	[690,925]
F	✓	✓	✓	✓	24.35	14.78	0.748	[690,925]

Key Findings¶

DBLM is the most critical component, lifting PSNR from 11.13 (baseline) to 24.99
DGSA substantially improves perceptual quality (FID reduced from 19.79 to 14.78) at a marginal cost to PSNR
DSST improves metrics while shortening the sampling range (1000 → [690, 925]), simultaneously boosting both quality and efficiency
The method achieves state-of-the-art performance on multiple fidelity metrics (PSNR/SSIM/IDA/LMD) and perceptual quality (FID) simultaneously, successfully balancing the fidelity-quality trade-off

Highlights & Insights¶

Elegance of problem reformulation: The core insight of the approach is to transform the complex blind restoration problem into a tractable Gaussian deblurring problem. By introducing an intermediate Gaussian-blurred representation, reliable low-frequency information is preserved while a well-defined guidance form is provided to the diffusion model.
Pervasive dynamic vs. static design: The three components (DBLM, DSST, DGSA) respectively achieve "dynamic" adaptation along three dimensions — degradation mapping, timestep selection, and guidance intensity — systematically addressing the one-size-fits-all limitations of prior methods.
Region-adaptive guidance: The DGSA design captures the spatially non-uniform distribution of fidelity and quality requirements, representing an effective improvement over the DPS guidance formulation.

Limitations & Future Work¶

Inference time is substantially longer (91.82s) compared to CodeFormer (0.06s), limiting real-time applicability
FID performance on the WebPhoto dataset is inferior to some competing methods
Hyperparameters of multi-guidance (number of guides, $std$ spacing) require manual specification and lack adaptive mechanisms
The generalizability of DGSA as a separately trained network remains to be further validated
Jointly optimizing the kernel update process (Eq. 7) and DGSA is a promising direction for future work

DPS (Diffusion Posterior Sampling) provides the theoretical foundation, upon which this work implements three key extensions
The limitations of DifFace's fixed-step strategy directly motivated the design of DSST
The choice of SwinIR as the restoration model reflects a pragmatic engineering philosophy of leveraging existing components rather than reinventing them

Rating¶

Novelty: ⭐⭐⭐⭐ The problem reformulation is elegant, and the three dynamic components are well-motivated and mutually complementary
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluations cover both synthetic and real-world datasets with comprehensive ablation studies
Writing Quality: ⭐⭐⭐⭐ Architecture diagrams are clear and mathematical derivations are rigorous
Value: ⭐⭐⭐⭐ Makes a significant contribution to diffusion-guided restoration, though inference speed limits practical deployment