GenDR: Lighten Generative Detail Restoration¶
Conference: ICLR 2026
arXiv: 2503.06790
Code: None
Area: Image Generation
Keywords: One-step super-resolution, latent space expansion, score distillation, VAE 16-channel, consistency distillation
TL;DR¶
GenDR is proposed as a lightweight one-step diffusion super-resolution (SR) model for generative detail restoration. It identifies a fundamental divergence between T2I and SR objectives (T2I requires multi-step + 4-channel latents, whereas SR needs fewer steps + 16-channel latents). The authors build a custom SD2.1-VAE16 base model (0.9B, using REPA for representation alignment to expand latent space without increasing model size) and propose CiD/CiDA consistency score identity distillation (integrating SR-specific priors into score distillation + adversarial learning + representation alignment). The minimalist pipeline, consisting only of UNet and VAE, achieves 77ms inference, outperforming existing SOTAs in both quality and efficiency.
Background & Motivation¶
Background: Real-world SR based on diffusion models has achieved significant progress, surpassing GAN-based methods in quality. However, they suffer from slow inference speeds and bottlenecks in detail fidelity.
Key Challenge: There is a fundamental divergence between text-to-image (T2I) and SR tasks. T2I generates complete images from noise, requiring multi-step inference and low-dimensional latent spaces (4-channel VAE to reduce generation difficulty). Conversely, SR only needs to supplement high-frequency details, requiring fewer steps but larger latent spaces (16-channel VAE) to preserve input information.
Limitations of Prior Work: Accelerating inference (e.g., OSEDiff one-step distillation) leads to significant quality degradation. Improving quality (e.g., DreamClear using PixArt-α + ControlNet) introduces massive computational overhead, creating a dilemma between quality and efficiency.
Model Scale: Existing 16-channel VAE diffusion models (e.g., FLUX 12B, SD3.5) are oversized for SR tasks. Processing 4× SR with FLUX in a single step requires >40GB VRAM and 1.4s, which is 5.3×/11.4× that of SD2.1.
Distillation Flaws: Current score distillation methods (VSD/SiD) are designed for T2I. Direct application to SR causes quality or content inconsistency due to training distribution mismatches and over-reliance on imperfect score functions.
Key Insight: SR requires a customized base model (16-channel + appropriate 0.9B scale) + a tailored distillation method (CiD incorporating SR priors) + a minimalist inference pipeline.
Method¶
Overall Architecture¶
GenDR decomposes the "customized diffusion for SR" into three layers. First, the SD2.1 UNet is paired with a 16-channel VAE and trained into a 0.9B base model (SD2.1-VAE16) to ensure the latent space can accommodate required SR details. Second, CiD/CiDA consistency distillation compresses multi-step sampling into a single step while injecting SR image priors into the distillation process. Finally, the scheduler and text encoder are removed, leaving a minimalist pipeline (VAE + UNet) for 77ms inference. The four key designs follow the sequence: "Base Model → Distillation → Inference Pipeline."
graph TD
HRX["HR Training Image"] --> BASE["SD2.1-VAE16 Base Model<br/>16-channel VAE + 0.9B UNet + REPA Alignment"]
BASE --> DISTILL
subgraph DISTILL["CiD / CiDA Consistency Distillation (Step Compression + SR Priors)"]
direction TB
CID["CiD: GT-calibrated Score Network<br/>+ Identity Transform"] --> CIDA["CiDA: + Adversarial Head<br/>+ REPA Semantic Alignment"]
end
DISTILL -->|One-step Weights| INFER
subgraph INFER["Minimalist Inference Pipeline (VAE16 + UNet, 77ms)"]
direction TB
LR["LR Input"] --> ENC["VAE16 Encoder"]
ENC --> UNET["One-step UNet<br/>Fixed Prompt Embedding"]
UNET --> DEC["VAE16 Decoder"]
end
INFER --> OUT["HR Output Image"]
Key Designs¶
1. SD2.1-VAE16: Restoring SR Details via 16-Channel Latent Space
The essence of SR is supplementing high-frequency details rather than generation from noise; thus, the integrity of input information within the latent space is critical. 4-channel VAEs designed for T2I use irreversible compression that discards fine textures. While 16-channel DiTs like FLUX/SD3.5 have capacity, their 12B scale is excessive for 4× SR. The authors replace the VAE of SD2.1 with an open-source 16-channel version and perform full-parameter training for a 0.9B base model. Representation Alignment (REPA) provides semantic supervision: an MLP head \(h\) after the first UNet downsampling block aligns intermediate features \(\mathbf{h}_t = f_\theta(\mathbf{z}_t)\) to pretrained DINOv2 HR representations \(\mathbf{h}_\mathcal{E} = \mathcal{E}(\mathbf{x}_h)\):
This expands the channels without increasing model parameters, yielding significant detail gains in SR despite a minor T2I regression.
2. CiD: Injecting SR Priors into Score Distillation for Stable One-step Output
Score distillation methods like VSD/SiD, designed for T2I, align with text embedding distributions. Direct application to SR results in content drift and inconsistent quality. CiD modifies SiD in two ways: first, it trains the "real" score network \(\phi\) using HR targets \(\mathbf{z}_h\) to anchor the output distribution on the high-fidelity image manifold; second, it replaces fluctuating generation results \(\mathbf{z}_g\) in the distillation target with stable HR targets \(\mathbf{z}_h\) for identity transformation. The final loss adds a CFG-guided term \(\mathcal{L}_\theta^{(3)}\) targeting \(\mathbf{z}_h\) to the original SiD term \(\mathcal{L}_\theta^{(1)}\):
CiD effectively integrates SR image priors, improving Q-Align from 4.391 to 4.428 over SiD.
3. CiDA: Adversarial Learning for Realism and REPA for Stability
Pure score distillation often produces over-smoothed "AI-looking" details. CiDA adds two components to CiD. An adversarial term uses the pretrained UNet \(\phi\) as a feature extractor with a discriminant head \(h\) to force results into the real texture distribution. REPA provides high-level semantic regularization to prevent structural distortion during adversarial training and speed up convergence:
VRAM cost is controlled by using LoRA (rank=64) for the discriminator and sharing the base model for feature extraction. Q-Align further improves to 4.453.
4. Minimalist Inference Pipeline: Eliminating Redundant Components
For one-step inference, the multi-step scheduler is redundant and replaced by fixed \(\bar{\alpha}_t = \bar{\beta}_t = 0.5\). The text encoder is replaced by pre-computed fixed prompt embeddings, as a general quality description suffices for SR. This reduces parameters from 1775M to 933M and inference time from over 113ms to 77ms (A100, 512²), with negligible quality loss compared to dynamic prompt generation.
Key Experimental Results¶
Table 1: Quantitative Comparison on Synthetic ImageNet-Test (×4 SR)¶
| Method | Steps | PSNR↑ | NIQE↓ | LIQE↑ | ClipIQA↑ | MUSIQ↑ | Q-Align↑ |
|---|---|---|---|---|---|---|---|
| Real-ESRGAN | GAN | 26.62 | 4.49 | 3.84 | 0.509 | 64.81 | 3.423 |
| DiffBIR-50 | 50 | 25.45 | 4.93 | 4.64 | 0.749 | 73.04 | 4.323 |
| DreamClear-50 | 50 | 24.76 | 5.38 | 4.43 | 0.765 | 70.08 | 4.092 |
| OSEDiff-1 | 1 | 24.82 | 4.28 | 4.56 | 0.678 | 71.74 | 4.067 |
| InvSR-1 | 1 | 23.81 | 4.39 | 4.56 | 0.711 | 72.38 | 3.987 |
| Ours (GenDR-1) | 1 | 24.14 | 4.13 | 4.81 | 0.740 | 74.68 | 4.361 |
Table 2: Quantitative Comparison on RealSet80¶
| Method | Inference Time | NIQE↓ | LIQE↑ | ClipIQA↑ | MUSIQ↑ | Q-Align↑ |
|---|---|---|---|---|---|---|
| StableSR-50 | 3731ms | 3.40 | 3.85 | 0.740 | 67.58 | 4.087 |
| SeeSR-50 | 6359ms | 4.37 | 4.28 | 0.712 | 69.74 | 4.306 |
| DreamClear-50 | 6892ms | 3.73 | 3.96 | 0.724 | 67.22 | 4.121 |
| OSEDiff-1 | 103ms | 3.98 | 4.13 | 0.704 | 69.19 | 4.306 |
| InvSR-1 | 115ms | 4.03 | 4.29 | 0.727 | 69.79 | 4.301 |
| Ours (GenDR-1) | 77ms | 3.98 | 4.52 | 0.742 | 71.57 | 4.453 |
Ablation Study: Distillation Strategy (RealSet80)¶
| Base Model | Distillation Strategy | LIQE↑ | ClipIQA↑ | MUSIQ↑ | Q-Align↑ |
|---|---|---|---|---|---|
| SD2.1-VAE4 | VSD | 4.13 | 0.704 | 69.19 | 4.306 |
| SD2.1-VAE4 | CiDA | 4.32 | 0.723 | 70.13 | 4.386 |
| SD2.1-VAE16 | VSD | 4.12 | 0.691 | 68.82 | 4.373 |
| SD2.1-VAE16 | SiD | 4.25 | 0.702 | 69.33 | 4.391 |
| SD2.1-VAE16 | CiD | 4.44 | 0.715 | 70.61 | 4.428 |
| SD2.1-VAE16 | CiDA | 4.52 | 0.742 | 71.57 | 4.453 |
Key Findings¶
- Target divergence is the root cause: T2I needs multi-step + 4-channel latents to generate content from noise; SR only needs to supplement high-frequency details. Reusing T2I models for SR is sub-optimal.
- 16-channel VAE is vital for SR: Even on 0.9B models, VAE16 preserves significantly more detail and structure than VAE4.
- CiDA facilitates progressive gains: The evolution from VSD → SiD → CiD → CiDA shows clear improvements, with CiD contributing ~0.05 and Adversarial ~0.03 to Q-Align.
- Fixed prompt embeddings maintain quality: Compared to dynamic LLM-generated prompts, fixed embeddings only drop MUSIQ by 0.17 while reducing inference from 113ms to 77ms.
- Pareto optimal efficiency: GenDR achieves the best NR-IQA scores with the fastest inference (77ms) and near-minimal parameter count (933M), accelerating by 89.5× relative to DreamClear.
Highlights & Insights¶
- Deep Insight: Systematically analyzes the divergence between T2I and SR objectives, providing a theoretical foundation for custom diffusion SR models.
- Systematic Solution: Optimizes across the base model (VAE16), distillation method (CiDA), and pipeline efficiency.
- Extreme Efficiency: 77ms one-step inference, 25-33% faster than previous one-step methods like OSEDiff/InvSR.
- Training Practicality: Efficiently trains three UNet variants (score, generator, discriminator) using LoRA and model sharing.
Limitations & Future Work¶
- Latent Space Exploration: The effectiveness of 16-channel VAE is verified, but larger spaces (32/64 channels) were not explored due to training costs.
- CiDA VRAM Requirements: Despite LoRA/DeepSpeed, CiDA requires significant VRAM, making it difficult to scale to large DiT models (e.g., FLUX).
- PSNR Trade-off: While leading in perceptual metrics (LIQE/MUSIQ), GenDR's PSNR is lower than GAN-based and some multi-step methods, indicating a trade-off in pixel-level fidelity.
Related Work & Insights¶
| Dimension | OSEDiff (Wu et al., 2024b) | Ours (GenDR) |
|---|---|---|
| Base Model | SD2.1 (4-channel VAE, 0.9B) | SD2.1-VAE16 (16-channel VAE, 0.9B) |
| Mechanism | VSD + L1/MSE Regularization | CiDA: SR Priors + Adv + REPA |
| Inference Time | 103ms | 77ms |
| Q-Align | 4.306 | 4.453 |
| Dimension | DreamClear (Ai et al., 2025) | Ours (GenDR) |
|---|---|---|
| Base Model | PixArt-α (2.2B) | SD2.1-VAE16 (0.9B) |
| Steps | 50 | 1 |
| Function | 2 ControlNets + MLLM | Minimalist (Fixed Embedding) |
| Inference Time | 6892ms (3×A100) | 77ms (1×A100) |
| MUSIQ | 67.22 | 71.57 |
Rating (1-5)¶
- Novelty: 4 — Insight into T2I/SR divergence is novel; CiD's integration of SR priors into score distillation is an original contribution.
- Technical Depth: 4 — Well-modeled evolution from VSD to CiDA with rigorous ablation of design decisions.
- Experimental Thoroughness: 4 — Comprehensive evaluation across 13 metrics and real/synthetic datasets, though downstream task evaluation is absent.
- Writing Quality: 4 — High-quality presentation with clear motivation and logical progression.