OSDFace: One-Step Diffusion Model for Face Restoration¶
Conference: CVPR 2025
arXiv: 2411.17163
Code: https://github.com/jkwang28/OSDFace
Area: Image Generation
Keywords: Face Restoration, One-Step Diffusion, Vector Quantization, Visual Prior, GAN Guidance
TL;DR¶
OSDFace proposes the first one-step diffusion model specifically designed for face restoration. By extracting rich prior information from low-quality faces using a Visual Representation Embedder (VRE), and combining this with facial identity loss and GAN guidance, it generates high-fidelity, natural, and identity-consistent face images in just a single step of inference (~0.1 second), comprehensively outperforming existing SOTA methods.
Background & Motivation¶
Background: Face restoration aims to restore high-quality (HQ) faces from low-quality (LQ) images corrupted by complex degradations such as blur, noise, downsampling, and JPEG compression. Current methods mainly fall into three categories: CNN/Transformer-based methods (e.g., RestoreFormer++), GAN-based methods (which suffer from unstable training and mode collapse), and diffusion-based methods (e.g., DifFace, DiffBIR, which offer good quality but slow inference).
Limitations of Prior Work: First, multi-step diffusion models incur high inference costs—PGDiff requires 1000 steps/85.8 seconds, and DiffBIR requires 50 steps/9.03 seconds. Second, existing methods perform poorly in terms of "harmony"; even when basic facial features (eyes, mouth) are restored, details such as hair and complex backgrounds remain unnatural. The root cause is the insufficient incorporation of facial priors: some methods (DifFace, DiffBIR) completely ignore face priors, while others (PGDiff) use priors but limit the generation capability. Third, general one-step diffusion restoration models (e.g., OSEDiff) are not specifically designed for faces; humans are extremely sensitive to facial features and can perceive even minor inconsistencies.
Key Challenge: The contradiction between fast inference (one-step diffusion) and high-quality face restoration (which requires rich priors); general image restoration models cannot meet the high standards of the specialized face domain.
Goal: To design a one-step diffusion model specifically for faces, achieving fast inference, high-fidelity restoration, and identity consistency simultaneously.
Key Insight: The authors observe that VQ-based prior methods (such as CodeFormer) can effectively utilize codebooks to capture facial features, but directly generating images using a codebook lacks details. Combining VQ priors with diffusion models—by using VQ to extract rich priors as conditioning for the diffusion model—can leverage the strengths of both.
Core Idea: Design a Visual Representation Embedder (VRE) to directly extract visual prior prompts from low-quality images (bypassing the information loss in image \(\rightarrow\) tag \(\rightarrow\) embedding), and combine it with facial identity loss and GAN distribution alignment to achieve high-quality, one-step face restoration.
Method¶
Overall Architecture¶
The training of OSDFace consists of two stages. First stage: Train the Visual Representation Embedder (VRE). Through self-reconstruction training of both HQ and LQ VQVAEs, a visual token dictionary is established, and the feature categories of the two domains are aligned. Second stage: Integrate the pre-trained VRE into Stable Diffusion. The LQ image is mapped to the latent space via the VAE encoder, the VRE extracts visual prompt embeddings, the UNet (fine-tuned only via LoRA) predicts the one-step noise, and the VAE decoder reconstructs the HQ face. The entire inference process takes only 0.1 seconds.
Key Designs¶
-
Visual Representation Embedder (VRE):
- Function: Extracts rich visual prior prompts from low-quality faces.
- Mechanism: VRE consists of two parts. Visual Tokenizer: LQ VAE encoder \(E_L\) + VQ matching function \(\mathcal{M}\), which maps LQ faces to category tokens \(\mathbf{Q}_L = \mathcal{M}(E_L(I_L))\) in a learnable LQ dictionary \(\mathbb{C}_L = \{c_q \in \mathbb{R}^d\}_{q=1}^N\). VQ Embedder: Uses tokens as indices to look up the corresponding embedding vectors in the dictionary, \(z_k = \text{dict}(q)\), with a time complexity of \(\mathcal{O}(1)\). Unlike image-to-tag methods, VRE directly tokenizes faces into visual embeddings, avoiding the information loss from image \(\rightarrow\) tag \(\rightarrow\) embedding.
- Design Motivation: Attention map visualizations show that VRE focuses on both facial and non-facial features (hair, background, etc.), capturing richer information than text tags.
-
Feature Association Training Strategy:
- Function: Aligns the VQ dictionary categories of the HQ and LQ domains.
- Mechanism: Construct HQ and LQ VQ dictionaries, and train VQVAEs separately for self-reconstruction. Then, inspired by CLIP, construct a similarity matrix \(M_{\text{assoc}}\) of HQ and LQ encoded features, and use a cross-entropy loss to enhance diagonal correlation, guiding the attention of the LQ encoder to align with the HQ encoder. The association loss is \(\mathcal{L}_{\text{assoc}} = (\mathcal{L}_{\text{CE}}^H + \mathcal{L}_{\text{CE}}^L) / 2\).
- Design Motivation: The LQ encoder may focus on meaningless categories due to severe degradation; cross-domain alignment ensures that LQ tokens correspond to meaningful HQ semantics.
-
Facial Identity Loss and GAN Guidance:
- Function: Ensures identity consistency and overall realism of the restored face.
- Mechanism: Facial Identity Loss: Uses a pre-trained ArcFace model to extract identity embeddings from the generated face and ground truth (GT), compute the cosine similarity loss \(\mathcal{L}_{\text{ID}} = 1 - \cos(\mathcal{F}(I_H), \mathcal{F}(\hat{I}_H))\). Perceptual Loss: Uses edge-aware DISTS (EA-DISTS), which adds Sobel edge-processed DISTS to standard DISTS to enhance texture and edge detail restoration. GAN Loss: Uses a discriminator to align the generated distribution with the real distribution in the diffusion latent space, providing more flexible training signals than distillation. The total loss is \(\mathcal{L}_{\text{gen}} = \lambda_{\text{dis}} \mathcal{L}_{\mathcal{G}} + \lambda_{\text{ID}} \mathcal{L}_{\text{ID}} + \lambda_{\text{per}} \mathcal{L}_{\text{EA-DISTS}} + \text{MSE}\).
- Design Motivation: Humans are extremely sensitive to faces; even minor inconsistencies are easily detected. Multi-aspect losses are required to ensure harmony: identity loss governs identity consistency, GAN loss governs distribution alignment, and perceptual loss governs texture details.
Loss & Training¶
First-stage VRE training loss: \(\mathcal{L}_{\text{total}} = \mathcal{L}_1 + \lambda_{\text{per}} \mathcal{L}_{\text{per}} + \lambda_{\text{dis}} \mathcal{L}_{\text{dis}} + \mathcal{L}_{\text{VQ}} + \lambda_{\text{assoc}} \mathcal{L}_{\text{assoc}}\), where \(\lambda_{\text{assoc}}\) is initially 0 and later set to 1. In the second stage, the VAE encoder/decoder and VRE are frozen, UNet is fine-tuned only via LoRA, and the generator and discriminator are trained alternately. During inference, a predefined fixed timestep \(T_L\) is used, and the LQ latent vector is directly used as the UNet input (not pure Gaussian noise).
Key Experimental Results¶
Main Results¶
| Method | Steps | LPIPS↓ | DISTS↓ | MUSIQ↑ | NIQE↓ | FID(HQ)↓ |
|---|---|---|---|---|---|---|
| CodeFormer | - | 0.3412 | 0.2151 | 75.94 | 4.52 | 26.86 |
| DifFace (250 steps) | 250 | 0.3469 | 0.2126 | 66.75 | 4.64 | 22.24 |
| DiffBIR (50 steps) | 50 | 0.3740 | 0.2340 | 75.64 | 6.28 | 32.51 |
| OSEDiff* (1 step) | 1 | 0.3496 | 0.2200 | 69.98 | 5.33 | 37.13 |
| OSDFace (1 step) | 1 | 0.3365 | 0.1773 | 75.64 | 3.88 | 17.06 |
Ablation Study¶
Figure 2 in the paper shows a comprehensive performance radar chart. OSDFace achieves the best performance across most of the 8 metrics: LPIPS, DISTS, MUSIQ, NIQE, Deg, LMD, FID(FFHQ), and FID(HQ), notably leading by a wide margin in DISTS (texture quality) and NIQE (naturalness).
Key Findings¶
- OSDFace outperforms all multi-step diffusion and non-diffusion methods with only 0.1 seconds of inference, requiring only 2.132T MACs of computation.
- Compared to the general one-step diffusion model OSEDiff, DISTS decreases from 0.2200 to 0.1773 (-19.4%), and FID(HQ) drops from 37.13 to 17.06 (-54%), proving the necessity of face-specific designs.
- Excellent zero-shot generalization performance on real-world datasets (WIDER, WebPhoto, LFW).
- Attention map visualizations confirm that VRE can simultaneously focus on facial features and non-facial regions like backgrounds and hair.
Highlights & Insights¶
- The design of VRE is highly elegant: it leverages a VQ dictionary to capture category-level priors and directly generates prompt embeddings from visual tokens, avoiding the information bottleneck of image-to-tag.
- The cross-domain feature association strategy enables the LQ dictionary to find meaningful semantic mappings even when the input is heavily degraded.
- It does not rely on distillation (no multi-step teacher model required), but instead uses GAN guidance to achieve distribution alignment, making training more flexible.
- Achieving 512×512 face restoration in 0.1 seconds holds great practical deployment value.
Limitations & Future Work¶
- The current model is designed specifically for faces; generalizing to other image types would require additional adaptation.
- The VQ dictionary size is fixed, which might lack coverage when facing extremely diverse facial styles.
- One-step diffusion has lower generation diversity than multi-step diffusion. This has little impact on deterministic tasks like face restoration, but could be restrictive in scenarios requiring high diversity.
- Performance under extreme degradation (e.g., extremely low resolution or severe occlusions) remains to be further verified.
Related Work & Insights¶
- Unlike CodeFormer (where VQ priors directly generate images) and PGDiff (where VQ priors act as diffusion guidance targets), OSDFace uses VQ priors as conditional prompts for the diffusion model, striking a better balance between flexibility and quality.
- The introduction of facial identity loss underscores the importance of identity preservation in face restoration tasks.
- The LQ dictionary representation in VRE may provide insights for other degraded-image understanding tasks (e.g., medical image enhancement).
Rating¶
- Novelty: ⭐⭐⭐⭐ — Clever VRE design; the combination of VQ prior and one-step diffusion is highly novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation with 8 metrics, comparisons against multiple baselines, and real-world zero-shot testing.
- Writing Quality: ⭐⭐⭐⭐ — Abundant diagrams, clear description of methods, and convincing visualization analyses.
- Value: ⭐⭐⭐⭐⭐ — High-quality face restoration in 0.1 seconds, offering direct industrial application value.