Exploiting Diffusion Prior for Task-driven Image Restoration¶

Conference: ICCV 2025 arXiv: 2507.22459 Code: None Area: Image Restoration Keywords: Task-driven image restoration, diffusion prior, high-level vision tasks, partial diffusion, short-step denoising

TL;DR¶

This paper proposes EDTR, a method that leverages diffusion model priors via a pre-restoration + partial diffusion strategy combined with short-step denoising to effectively recover task-relevant details, achieving significant gains in classification, segmentation, and detection under complex degradation scenarios.

Background & Motivation¶

Real-world images are frequently affected by multiple degradation factors (downsampling, blur, noise, JPEG compression, etc.), causing severe performance drops in high-level vision tasks such as classification, segmentation, and detection. Although image restoration (IR) appears to be a natural preprocessing solution, research has shown that naively applying IR as a front-end pipeline does not effectively recover task-critical information.

Task-driven Image Restoration (TDIR) therefore emerges with the goal of performing image recovery guided by downstream task performance. However, existing TDIR methods face three major challenges:

Single-degradation limitation: Prior TDIR methods are typically designed for a single degradation type (e.g., super-resolution only or denoising only), making them ill-suited for real-world scenarios involving multiple concurrent degradations.

Insufficient cues under severe degradation: When images are heavily degraded by multiple complex factors, very few cues remain for restoration, making it difficult for conventional methods to recover task-relevant details.

Misuse of diffusion priors: Although Stable Diffusion provides a powerful natural image prior, the standard SD-based IR pipeline (starting from pure noise with 50 denoising steps) tends to generate visually plausible but task-irrelevant results—for example, generating incorrect eye shapes that lead to wrong bird species classification.

Two root causes are identified: (1) generation from pure noise prevents direct exploitation of useful cues in the LQ image; (2) long-step denoising introduces redundant details that dilute task-critical information.

Method¶

Overall Architecture¶

EDTR consists of three core components: pre-restoration with partial diffusion, short-step denoising, and a task-driven training loss. The framework uses StableDiffusion-2.1 as the backbone, ControlNet as a trainable adapter, and SwinIR as the pixel-error pre-restoration network.

Key Designs¶

Pre-restoration and Partial Diffusion:
- Function: The LQ image is first pre-restored at the pixel level and then injected directly into the diffusion process, rather than starting generation from pure noise.
- Mechanism: SwinIR first pre-restores the LQ image as \(z_{\text{pre-res}} = \mathcal{E}(\mathcal{R}_{\text{pix}}(I_{\text{LQ}}))\); mild noise is then added to the pre-restored latent to initialize the diffusion process: \(z_{t,\text{partial}} = \sqrt{\bar\alpha_t} \cdot z_{\text{pre-res}} + \sqrt{1-\bar\alpha_t} \cdot \epsilon\), where the partial timestep \(t_p = 200\) (far smaller than \(T=1000\)).
- Design Motivation: SD is not trained to handle complex degradations, so directly feeding degraded images yields poor results. Pre-restoration removes most degradation artifacts, enabling SD to supplement task-relevant high-frequency details from a relatively clean foundation. Partial diffusion preserves the useful information in the LQ image instead of relying entirely on SD's generative capacity as a pure-noise initialization would.
Short-step Denoising:
- Function: At inference time, denoising is completed in very few steps (1 or 4), rather than the conventional 50 steps.
- Mechanism: For \(n\)-step denoising, a timestep schedule \(\mathcal{T} = [t_p, \lfloor t_p/n \cdot (n-1)\rfloor, \ldots, \lfloor t_p/n \rfloor]\) is constructed. For 1-step denoising this simplifies to: \(z_{\text{diff-res}} = \frac{z_{t_p,\text{partial}} - \sqrt{1-\bar\alpha_{t_p}} \epsilon_\theta(z_{t_p,\text{partial}}, t_p, z_{\text{pre-res}})}{\sqrt{\bar\alpha_{t_p}}}\)
- Design Motivation: Experiments reveal that increasing the number of denoising steps improves perceptual quality but actually degrades task performance. Each denoising step injects details from the diffusion prior; repeated injection generates task-irrelevant redundant textures that dilute task-critical information. Short-step denoising preserves the useful content from pre-restoration while minimally supplementing details through the diffusion prior.
Task-driven Training Loss:
- Function: Dedicated loss functions are designed to guide the diffusion prior toward recovering task-relevant details.
- Mechanism: The high-level feature (HLF) loss is defined as \(\mathcal{L}_{\text{HLF}} = \frac{1}{2}(\|\mathcal{H}^f(I_{\text{EDTR}}) - \mathcal{H}^f(I_{\text{HQ}})\|_1 + \|\mathcal{H}^f_{\text{HQ}}(I_{\text{EDTR}}) - \mathcal{H}^f_{\text{HQ}}(I_{\text{HQ}})\|_1)\), measuring restoration quality in the intermediate feature spaces of two task networks: the currently trained \(\mathcal{H}\) and an HQ-pretrained \(\mathcal{H}_{\text{HQ}}\).
- Design Motivation: The noise prediction loss \(\mathcal{L}_\epsilon\) is not designed for task-driven restoration; the HLF loss directly constrains restoration quality in a task-relevant feature space, realizing genuinely task-oriented image recovery.

Loss & Training¶

Alternating training: At each iteration, EDTR is first updated with \(\mathcal{L}_\text{HLF}\), then the task network \(\mathcal{H}\) is updated with \(\mathcal{L}_\text{task} + \alpha \mathcal{L}_\text{FM}\).
Task loss: \(\mathcal{L}_\text{task} = f_\text{task}(\mathcal{H}(I_{\text{EDTR} \oplus \text{HQ}}), y)\), jointly training on half-batch EDTR-restored images and half-batch HQ images to stabilize training.
Feature matching loss: \(\mathcal{L}_\text{FM} = \|\mathcal{H}^f(I_{\text{EDTR} \oplus \text{HQ}}) - \mathcal{H}^f_\text{HQ}(I_\text{HQ})\|_1\), functioning as cross-quality knowledge distillation.
Wavelet color correction: After decoding, a wavelet transform separates high- and low-frequency components; high frequencies are taken from the diffusion result and low frequencies from the pre-restored output: \(I_\text{EDTR} = \mathbf{H}(\mathcal{D}(z_\text{diff-res})) + \mathbf{L}(\mathcal{R}_\text{pix}(I_\text{LQ}))\).

Key Experimental Results¶

Main Results¶

Image Classification (CUB200) + Semantic Segmentation (VOC2012) + Object Detection (VOC2012) Degradation type Mixture-A: SR(s=8) + JPEG(q=75)

Method	Classification Acc↑ (%)	Segmentation mIoU↑ (%)	Detection mAP↑ (%)
Oracle (HQ)	82.5	67.0	36.9
No restoration	60.8	47.4	14.5
SwinIR	70.0	56.2	21.0
SR4IR	71.5	56.2	22.3
EDTR-1 step	74.4	64.1	31.2
EDTR-4 step	74.1	65.3	33.4

Ablation Study¶

Task performance on Mixture-B (more complex degradation)

Method	Classification Acc↑ (%)	Segmentation mIoU↑ (%)	Detection mAP↑ (%)
No restoration	47.6	40.2	0.0
SwinIR	60.7	50.4	22.1
SR4IR	63.4	51.0	22.6
EDTR-1 step	68.8	60.4	30.6
EDTR-4 step	68.4	62.9	33.4

Perceptual quality metrics comparison (Mixture-A)

Method	NIQE↓	Q-Align↑	PSNR↑
SwinIR	9.97	2.55	25.21
SR4IR	5.41	3.37	24.07
EDTR-1 step	4.68	3.53	23.63
EDTR-4 step	4.26	3.81	22.46

Key Findings¶

EDTR achieves the most pronounced gains on detection: on Mixture-A, mAP improves from 14.5% (no restoration) to 33.4% (+18.9%); on Mixture-B, from 0.0% to 33.4%.
PSNR and task performance are negatively correlated: EDTR's PSNR is lower than SwinIR's, yet it leads substantially on all task metrics, demonstrating that pixel fidelity does not equate to task effectiveness.
Short-step denoising is critical for task performance: 1-step and 4-step results are comparable but both substantially outperform conventional 50-step denoising.
Perceptual quality (NIQE/Q-Align) and task performance can be improved simultaneously, challenging the assumption that a trade-off between the two is inevitable.

Highlights & Insights¶

The PSNR–task performance paradox: This work clearly demonstrates, for the first time within a TDIR framework, the inconsistency between pixel-level metrics and task-level metrics, which carries important methodological implications.
Elegant design of partial diffusion: The pipeline of pre-restoration → mild noise addition → short-step denoising simultaneously preserves useful information from the LQ image and leverages the diffusion prior to supplement high-frequency details.
One denoising step suffices: This challenges the conventional wisdom that more diffusion steps yield better results; in the TDIR setting, a single step is sufficient.
Generality: The same framework applies to three fundamentally different tasks—classification, segmentation, and detection.

Limitations & Future Work¶

Reliance on a pretrained StableDiffusion model incurs substantial computational cost (SD inference + ControlNet + joint task network training).
HQ images are required for training, making the method inapplicable to fully unsupervised real-world scenarios.
Only three vision tasks are validated; effectiveness on finer-grained tasks (e.g., keypoint detection, instance segmentation) remains unknown.
Degradation types must be predefined during training; generalization to unknown degradations has yet to be verified.
The quality of the pre-restoration network SwinIR directly impacts final results.

SR4IR: Proposed the joint training paradigm and task-driven perceptual loss for TDIR, serving as the direct predecessor of this work.
StableSR / DiffBIR: Demonstrated the powerful capability of SD as an image prior, without considering task-driven objectives.
ControlNet: Provides a mechanism for conditioning a frozen SD model on external inputs.
Insight: The generative power of diffusion models must be "tamed" to serve specific tasks; blindly pursuing perceptual quality can actually harm task performance.

Rating¶

Novelty: ⭐⭐⭐⭐ First effective use of diffusion priors in TDIR; the partial diffusion + short-step denoising design is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Three tasks × two degradation types with comprehensive ablations; dataset diversity could be further expanded.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated and the method pipeline is easy to follow, though notation is occasionally redundant.
Value: ⭐⭐⭐⭐ Provides valuable guidance on how to correctly exploit diffusion priors; the detection performance gains are particularly notable.