Exploiting Diffusion Prior for Task-driven Image Restoration¶
Conference: ICCV 2025 arXiv: 2507.22459 Code: None Area: Image Restoration Keywords: Task-driven image restoration, diffusion prior, high-level vision tasks, partial diffusion, short-step denoising
TL;DR¶
This paper proposes EDTR, a method that leverages diffusion model priors via a pre-restoration + partial diffusion strategy combined with short-step denoising to effectively recover task-relevant details, achieving significant gains in classification, segmentation, and detection under complex degradation scenarios.
Background & Motivation¶
Real-world images are frequently affected by multiple degradation factors (downsampling, blur, noise, JPEG compression, etc.), causing severe performance drops in high-level vision tasks such as classification, segmentation, and detection. Although image restoration (IR) appears to be a natural preprocessing solution, research has shown that naively applying IR as a front-end pipeline does not effectively recover task-critical information.
Task-driven Image Restoration (TDIR) therefore emerges with the goal of performing image recovery guided by downstream task performance. However, existing TDIR methods face three major challenges:
Single-degradation limitation: Prior TDIR methods are typically designed for a single degradation type (e.g., super-resolution only or denoising only), making them ill-suited for real-world scenarios involving multiple concurrent degradations.
Insufficient cues under severe degradation: When images are heavily degraded by multiple complex factors, very few cues remain for restoration, making it difficult for conventional methods to recover task-relevant details.
Misuse of diffusion priors: Although Stable Diffusion provides a powerful natural image prior, the standard SD-based IR pipeline (starting from pure noise with 50 denoising steps) tends to generate visually plausible but task-irrelevant results—for example, generating incorrect eye shapes that lead to wrong bird species classification.
Two root causes are identified: (1) generation from pure noise prevents direct exploitation of useful cues in the LQ image; (2) long-step denoising introduces redundant details that dilute task-critical information.
Method¶
Overall Architecture¶
EDTR consists of three core components: pre-restoration with partial diffusion, short-step denoising, and a task-driven training loss. The framework uses StableDiffusion-2.1 as the backbone, ControlNet as a trainable adapter, and SwinIR as the pixel-error pre-restoration network.
Key Designs¶
-
Pre-restoration and Partial Diffusion:
- Function: The LQ image is first pre-restored at the pixel level and then injected directly into the diffusion process, rather than starting generation from pure noise.
- Mechanism: SwinIR first pre-restores the LQ image as \(z_{\text{pre-res}} = \mathcal{E}(\mathcal{R}_{\text{pix}}(I_{\text{LQ}}))\); mild noise is then added to the pre-restored latent to initialize the diffusion process: \(z_{t,\text{partial}} = \sqrt{\bar\alpha_t} \cdot z_{\text{pre-res}} + \sqrt{1-\bar\alpha_t} \cdot \epsilon\), where the partial timestep \(t_p = 200\) (far smaller than \(T=1000\)).
- Design Motivation: SD is not trained to handle complex degradations, so directly feeding degraded images yields poor results. Pre-restoration removes most degradation artifacts, enabling SD to supplement task-relevant high-frequency details from a relatively clean foundation. Partial diffusion preserves the useful information in the LQ image instead of relying entirely on SD's generative capacity as a pure-noise initialization would.
-
Short-step Denoising:
- Function: At inference time, denoising is completed in very few steps (1 or 4), rather than the conventional 50 steps.
- Mechanism: For \(n\)-step denoising, a timestep schedule \(\mathcal{T} = [t_p, \lfloor t_p/n \cdot (n-1)\rfloor, \ldots, \lfloor t_p/n \rfloor]\) is constructed. For 1-step denoising this simplifies to: \(z_{\text{diff-res}} = \frac{z_{t_p,\text{partial}} - \sqrt{1-\bar\alpha_{t_p}} \epsilon_\theta(z_{t_p,\text{partial}}, t_p, z_{\text{pre-res}})}{\sqrt{\bar\alpha_{t_p}}}\)
- Design Motivation: Experiments reveal that increasing the number of denoising steps improves perceptual quality but actually degrades task performance. Each denoising step injects details from the diffusion prior; repeated injection generates task-irrelevant redundant textures that dilute task-critical information. Short-step denoising preserves the useful content from pre-restoration while minimally supplementing details through the diffusion prior.
-
Task-driven Training Loss:
- Function: Dedicated loss functions are designed to guide the diffusion prior toward recovering task-relevant details.
- Mechanism: The high-level feature (HLF) loss is defined as \(\mathcal{L}_{\text{HLF}} = \frac{1}{2}(\|\mathcal{H}^f(I_{\text{EDTR}}) - \mathcal{H}^f(I_{\text{HQ}})\|_1 + \|\mathcal{H}^f_{\text{HQ}}(I_{\text{EDTR}}) - \mathcal{H}^f_{\text{HQ}}(I_{\text{HQ}})\|_1)\), measuring restoration quality in the intermediate feature spaces of two task networks: the currently trained \(\mathcal{H}\) and an HQ-pretrained \(\mathcal{H}_{\text{HQ}}\).
- Design Motivation: The noise prediction loss \(\mathcal{L}_\epsilon\) is not designed for task-driven restoration; the HLF loss directly constrains restoration quality in a task-relevant feature space, realizing genuinely task-oriented image recovery.
Loss & Training¶
- Alternating training: At each iteration, EDTR is first updated with \(\mathcal{L}_\text{HLF}\), then the task network \(\mathcal{H}\) is updated with \(\mathcal{L}_\text{task} + \alpha \mathcal{L}_\text{FM}\).
- Task loss: \(\mathcal{L}_\text{task} = f_\text{task}(\mathcal{H}(I_{\text{EDTR} \oplus \text{HQ}}), y)\), jointly training on half-batch EDTR-restored images and half-batch HQ images to stabilize training.
- Feature matching loss: \(\mathcal{L}_\text{FM} = \|\mathcal{H}^f(I_{\text{EDTR} \oplus \text{HQ}}) - \mathcal{H}^f_\text{HQ}(I_\text{HQ})\|_1\), functioning as cross-quality knowledge distillation.
- Wavelet color correction: After decoding, a wavelet transform separates high- and low-frequency components; high frequencies are taken from the diffusion result and low frequencies from the pre-restored output: \(I_\text{EDTR} = \mathbf{H}(\mathcal{D}(z_\text{diff-res})) + \mathbf{L}(\mathcal{R}_\text{pix}(I_\text{LQ}))\).
Key Experimental Results¶
Main Results¶
Image Classification (CUB200) + Semantic Segmentation (VOC2012) + Object Detection (VOC2012) Degradation type Mixture-A: SR(s=8) + JPEG(q=75)
| Method | Classification Acc↑ (%) | Segmentation mIoU↑ (%) | Detection mAP↑ (%) |
|---|---|---|---|
| Oracle (HQ) | 82.5 | 67.0 | 36.9 |
| No restoration | 60.8 | 47.4 | 14.5 |
| SwinIR | 70.0 | 56.2 | 21.0 |
| SR4IR | 71.5 | 56.2 | 22.3 |
| EDTR-1 step | 74.4 | 64.1 | 31.2 |
| EDTR-4 step | 74.1 | 65.3 | 33.4 |
Ablation Study¶
Task performance on Mixture-B (more complex degradation)
| Method | Classification Acc↑ (%) | Segmentation mIoU↑ (%) | Detection mAP↑ (%) |
|---|---|---|---|
| No restoration | 47.6 | 40.2 | 0.0 |
| SwinIR | 60.7 | 50.4 | 22.1 |
| SR4IR | 63.4 | 51.0 | 22.6 |
| EDTR-1 step | 68.8 | 60.4 | 30.6 |
| EDTR-4 step | 68.4 | 62.9 | 33.4 |
Perceptual quality metrics comparison (Mixture-A)
| Method | NIQE↓ | Q-Align↑ | PSNR↑ |
|---|---|---|---|
| SwinIR | 9.97 | 2.55 | 25.21 |
| SR4IR | 5.41 | 3.37 | 24.07 |
| EDTR-1 step | 4.68 | 3.53 | 23.63 |
| EDTR-4 step | 4.26 | 3.81 | 22.46 |
Key Findings¶
- EDTR achieves the most pronounced gains on detection: on Mixture-A, mAP improves from 14.5% (no restoration) to 33.4% (+18.9%); on Mixture-B, from 0.0% to 33.4%.
- PSNR and task performance are negatively correlated: EDTR's PSNR is lower than SwinIR's, yet it leads substantially on all task metrics, demonstrating that pixel fidelity does not equate to task effectiveness.
- Short-step denoising is critical for task performance: 1-step and 4-step results are comparable but both substantially outperform conventional 50-step denoising.
- Perceptual quality (NIQE/Q-Align) and task performance can be improved simultaneously, challenging the assumption that a trade-off between the two is inevitable.
Highlights & Insights¶
- The PSNR–task performance paradox: This work clearly demonstrates, for the first time within a TDIR framework, the inconsistency between pixel-level metrics and task-level metrics, which carries important methodological implications.
- Elegant design of partial diffusion: The pipeline of pre-restoration → mild noise addition → short-step denoising simultaneously preserves useful information from the LQ image and leverages the diffusion prior to supplement high-frequency details.
- One denoising step suffices: This challenges the conventional wisdom that more diffusion steps yield better results; in the TDIR setting, a single step is sufficient.
- Generality: The same framework applies to three fundamentally different tasks—classification, segmentation, and detection.
Limitations & Future Work¶
- Reliance on a pretrained StableDiffusion model incurs substantial computational cost (SD inference + ControlNet + joint task network training).
- HQ images are required for training, making the method inapplicable to fully unsupervised real-world scenarios.
- Only three vision tasks are validated; effectiveness on finer-grained tasks (e.g., keypoint detection, instance segmentation) remains unknown.
- Degradation types must be predefined during training; generalization to unknown degradations has yet to be verified.
- The quality of the pre-restoration network SwinIR directly impacts final results.
Related Work & Insights¶
- SR4IR: Proposed the joint training paradigm and task-driven perceptual loss for TDIR, serving as the direct predecessor of this work.
- StableSR / DiffBIR: Demonstrated the powerful capability of SD as an image prior, without considering task-driven objectives.
- ControlNet: Provides a mechanism for conditioning a frozen SD model on external inputs.
- Insight: The generative power of diffusion models must be "tamed" to serve specific tasks; blindly pursuing perceptual quality can actually harm task performance.
Rating¶
- Novelty: ⭐⭐⭐⭐ First effective use of diffusion priors in TDIR; the partial diffusion + short-step denoising design is elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three tasks × two degradation types with comprehensive ablations; dataset diversity could be further expanded.
- Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated and the method pipeline is easy to follow, though notation is occasionally redundant.
- Value: ⭐⭐⭐⭐ Provides valuable guidance on how to correctly exploit diffusion priors; the detection performance gains are particularly notable.