CVPR 2025 Image Restoration Real-world ISR diffusion compression adversarial distillation one-step diffusion GAN model pruning

AdcSR: Adversarial Diffusion Compression for Real-World Image Super-Resolution¶

Conference: CVPR 2025
arXiv: 2411.13383
Code: Guaishou74851/AdcSR
Institution: Peking University / The Hong Kong Polytechnic University / OPPO Research Institute
Area: Image Restoration / Super-Resolution
Keywords: Real-world ISR, diffusion compression, adversarial distillation, one-step diffusion, GAN, model pruning

TL;DR¶

An Adversarial Diffusion Compression (ADC) framework is proposed to distill the one-step diffusion model OSEDiff into a streamlined diffusion-GAN hybrid model. This achieves a 73% reduction in inference time, a 78% reduction in computational cost, and a 74% reduction in parameters while maintaining generative quality, reaching real-time super-resolution at 34.79 FPS.

Background & Motivation¶

Background: Based on Stable Diffusion (SD), real-world image super-resolution (ISR) methods have achieved significant success (StableSR, DiffBIR, SeeSR, etc.), but their multi-step inference hinders practical deployment. Recent one-step networks (OSEDiff, S3Diff) alleviate latency but still suffer from heavy computational overhead due to their reliance on large pre-trained SD models.

Limitations of Prior Work: - Multi-step diffusion methods (StableSR, etc.) suffer from extremely slow inference (dozens of iterations). - One-step diffusion methods (OSEDiff) still require the complete SD model: VAE encoder, prompt extractor, text encoder, and full UNet, totaling 1311M parameters and taking 0.07s per frame. - No existing method simultaneously addresses "maintaining quality" and "speeding up".

Key Challenge: Directly pruning or removing SD modules severely degrades generative capability, while retaining the entire model fails to meet real-time requirements.

Key Insight: By systematically analyzing each module of OSEDiff, they are classified into removable modules (VAE encoder, prompt extractor, text encoder, etc.) and prunable modules (UNet, VAE decoder). A two-stage scheme is designed for progressive compression.

Core Idea: Module removal + channel pruning + stage-wise adversarial distillation = real-time diffusion super-resolution.

Method¶

Adversarial Diffusion Compression (ADC) Framework¶

Module Classification¶

The modules of OSEDiff are classified into two categories: - Removable: VAE encoder, prompt extractor (RAM/BLIP2), text encoder, cross-attention modules, and timestep embedding. - Prunable: Denoising UNet (retaining the first 75% of channels) and VAE decoder (retaining the first 50% of channels).

Stage 1: Pre-training Pruned VAE Decoder¶

A VAE decoder with 50% channel pruning is trained from scratch on OpenImage + LAION-Face + LAION-Aesthetic.
Trained for 250K + 250K steps.
Loss: L1 + LPIPS + patch-based adversarial loss.
Restores the image decoding capability of the pruned VAE decoder.

Stage 2: Adversarial Distillation¶

Jointly fine-tune the UNet with 25% channel pruning and the first-layer block of the Stage 1 pre-trained VAE decoder.
Conducts adversarial training using a LoRA-adapted discriminator.
Loss: \(\mathcal{L} = \mathcal{L}_{distill} + \lambda_{adv} \mathcal{L}_{adv}\)

Key Technical Details¶

Temporal Direction Alignment (TDA)
- After removing the text encoder and timestep embedding, the timestep is fixed to \(t=1\) and the noise is set to \(\epsilon = 0\).
- A Straight-Through Estimator is used to handle the non-differentiable clipping operation.
UNet-VAE Connection Optimization
- Connects the UNet output directly to the lowest-resolution first-layer block of the VAE decoder.
- USes a PixelUnshuffle layer (scale factor of 2) to align the resolution of the latent space and LR image space.
Discriminator Design
- A lightweight discriminator based on LoRA with a LoRA rank of 4.
- Learning rate of 1e-6.

Key Experimental Results¶

Efficiency Comparison¶

Method	Steps	Inference Time (s)	FLOPs (G)	Params (M)	FPS
StableSR	200	11.50	—	—	—
DiffBIR	50	2.72	—	—	—
OSEDiff	1	0.11	513	1311	—
S3Diff	1	0.28	—	—	—
AdcSR	1	0.03	111	340	34.79

Speedup ratio: vs. StableSR 383×, vs. OSEDiff 3.7×, vs. S3Diff 9.3×.

Restoration Quality Comparison (DRealSR)¶

Method	PSNR↑	SSIM↑	LPIPS↓	DISTS↓
StableSR	27.63	—	0.3317	—
OSEDiff	28.02	—	0.3087	0.2239
AdcSR	28.10	—	0.3046	0.2200

Ablation Study¶

Configuration	PSNR↑	LPIPS↓	Params (M)	Time (s)
Keep VAE Encoder	27.97	0.3077	490	0.05
Remove VAE Encoder	28.10	0.3046	456	0.03

Configuration	FID↓	MUSIQ↑	MANIQA↑	CLIPIQA↑
Without connection optimization	140.09	65.18	0.5807	0.6756
With connection optimization	134.05	66.26	0.5927	0.7049

Loss & Training¶

Hardware & Scale¶

8× NVIDIA A100 (80GB)
Batch size of 96
Stage 1: 250K steps on OpenImage + 250K steps on LAION-Face/Aesthetic
Stage 2: 200K steps on LSDIR
Real-ESRGAN degradation pipeline to synthesize LR-HR pairs

Degradation Pipeline¶

Uses the high-order degradation model from Real-ESRGAN, including multiple stages of degradation such as blur, downsampling, noise, and JPEG compression.

Highlights & Insights¶

Systematic module classification scheme: Instead of mindless pruning, the role of each module is analyzed first before deciding on removal/pruning strategies.
Removing the VAE encoder actually improves quality (PSNR +0.13), indicating that redundant components might introduce noise.
34.79 FPS is the first real-time super-resolution speed achieved by a diffusion model.
Key to the two-stage design: First restore VAE decoding capability, then conduct end-to-end distillation to avoid performance collapse caused by one-step-to-destination compression.
The Straight-Through Estimator addresses the non-differentiability of clipping, which is a noteworthy engineering detail.
Scores better in LPIPS/DISTS compared to the teacher OSEDiff, demonstrating a "the student surpasses the master" effect after distillation.