Skip to content

Exploiting Diffusion Prior for Task-driven Image Restoration

Conference: ICCV 2025 arXiv: 2507.22459 Code: None Area: Image Restoration Keywords: Task-driven image restoration, diffusion prior, high-level vision tasks, partial diffusion, short-step denoising

TL;DR

This paper proposes EDTR, a method that leverages diffusion model priors via a pre-restoration + partial diffusion strategy combined with short-step denoising to effectively recover task-relevant details, achieving significant gains in classification, segmentation, and detection under complex degradation scenarios.

Background & Motivation

Real-world images are frequently affected by multiple degradation factors (downsampling, blur, noise, JPEG compression, etc.), causing severe performance drops in high-level vision tasks such as classification, segmentation, and detection. Although image restoration (IR) appears to be a natural preprocessing solution, research has shown that naively applying IR as a front-end pipeline does not effectively recover task-critical information.

Task-driven Image Restoration (TDIR) therefore emerges with the goal of performing image recovery guided by downstream task performance. However, existing TDIR methods face three major challenges:

Single-degradation limitation: Prior TDIR methods are typically designed for a single degradation type (e.g., super-resolution only or denoising only), making them ill-suited for real-world scenarios involving multiple concurrent degradations.

Insufficient cues under severe degradation: When images are heavily degraded by multiple complex factors, very few cues remain for restoration, making it difficult for conventional methods to recover task-relevant details.

Misuse of diffusion priors: Although Stable Diffusion provides a powerful natural image prior, the standard SD-based IR pipeline (starting from pure noise with 50 denoising steps) tends to generate visually plausible but task-irrelevant results—for example, generating incorrect eye shapes that lead to wrong bird species classification.

Two root causes are identified: (1) generation from pure noise prevents direct exploitation of useful cues in the LQ image; (2) long-step denoising introduces redundant details that dilute task-critical information.

Method

Overall Architecture

EDTR consists of three core components: pre-restoration with partial diffusion, short-step denoising, and a task-driven training loss. The framework uses StableDiffusion-2.1 as the backbone, ControlNet as a trainable adapter, and SwinIR as the pixel-error pre-restoration network.

Key Designs

  1. Pre-restoration and Partial Diffusion:

    • Function: The LQ image is first pre-restored at the pixel level and then injected directly into the diffusion process, rather than starting generation from pure noise.
    • Mechanism: SwinIR first pre-restores the LQ image as \(z_{\text{pre-res}} = \mathcal{E}(\mathcal{R}_{\text{pix}}(I_{\text{LQ}}))\); mild noise is then added to the pre-restored latent to initialize the diffusion process: \(z_{t,\text{partial}} = \sqrt{\bar\alpha_t} \cdot z_{\text{pre-res}} + \sqrt{1-\bar\alpha_t} \cdot \epsilon\), where the partial timestep \(t_p = 200\) (far smaller than \(T=1000\)).
    • Design Motivation: SD is not trained to handle complex degradations, so directly feeding degraded images yields poor results. Pre-restoration removes most degradation artifacts, enabling SD to supplement task-relevant high-frequency details from a relatively clean foundation. Partial diffusion preserves the useful information in the LQ image instead of relying entirely on SD's generative capacity as a pure-noise initialization would.
  2. Short-step Denoising:

    • Function: At inference time, denoising is completed in very few steps (1 or 4), rather than the conventional 50 steps.
    • Mechanism: For \(n\)-step denoising, a timestep schedule \(\mathcal{T} = [t_p, \lfloor t_p/n \cdot (n-1)\rfloor, \ldots, \lfloor t_p/n \rfloor]\) is constructed. For 1-step denoising this simplifies to: \(z_{\text{diff-res}} = \frac{z_{t_p,\text{partial}} - \sqrt{1-\bar\alpha_{t_p}} \epsilon_\theta(z_{t_p,\text{partial}}, t_p, z_{\text{pre-res}})}{\sqrt{\bar\alpha_{t_p}}}\)
    • Design Motivation: Experiments reveal that increasing the number of denoising steps improves perceptual quality but actually degrades task performance. Each denoising step injects details from the diffusion prior; repeated injection generates task-irrelevant redundant textures that dilute task-critical information. Short-step denoising preserves the useful content from pre-restoration while minimally supplementing details through the diffusion prior.
  3. Task-driven Training Loss:

    • Function: Dedicated loss functions are designed to guide the diffusion prior toward recovering task-relevant details.
    • Mechanism: The high-level feature (HLF) loss is defined as \(\mathcal{L}_{\text{HLF}} = \frac{1}{2}(\|\mathcal{H}^f(I_{\text{EDTR}}) - \mathcal{H}^f(I_{\text{HQ}})\|_1 + \|\mathcal{H}^f_{\text{HQ}}(I_{\text{EDTR}}) - \mathcal{H}^f_{\text{HQ}}(I_{\text{HQ}})\|_1)\), measuring restoration quality in the intermediate feature spaces of two task networks: the currently trained \(\mathcal{H}\) and an HQ-pretrained \(\mathcal{H}_{\text{HQ}}\).
    • Design Motivation: The noise prediction loss \(\mathcal{L}_\epsilon\) is not designed for task-driven restoration; the HLF loss directly constrains restoration quality in a task-relevant feature space, realizing genuinely task-oriented image recovery.

Loss & Training

  • Alternating training: At each iteration, EDTR is first updated with \(\mathcal{L}_\text{HLF}\), then the task network \(\mathcal{H}\) is updated with \(\mathcal{L}_\text{task} + \alpha \mathcal{L}_\text{FM}\).
  • Task loss: \(\mathcal{L}_\text{task} = f_\text{task}(\mathcal{H}(I_{\text{EDTR} \oplus \text{HQ}}), y)\), jointly training on half-batch EDTR-restored images and half-batch HQ images to stabilize training.
  • Feature matching loss: \(\mathcal{L}_\text{FM} = \|\mathcal{H}^f(I_{\text{EDTR} \oplus \text{HQ}}) - \mathcal{H}^f_\text{HQ}(I_\text{HQ})\|_1\), functioning as cross-quality knowledge distillation.
  • Wavelet color correction: After decoding, a wavelet transform separates high- and low-frequency components; high frequencies are taken from the diffusion result and low frequencies from the pre-restored output: \(I_\text{EDTR} = \mathbf{H}(\mathcal{D}(z_\text{diff-res})) + \mathbf{L}(\mathcal{R}_\text{pix}(I_\text{LQ}))\).

Key Experimental Results

Main Results

Image Classification (CUB200) + Semantic Segmentation (VOC2012) + Object Detection (VOC2012) Degradation type Mixture-A: SR(s=8) + JPEG(q=75)

Method Classification Acc↑ (%) Segmentation mIoU↑ (%) Detection mAP↑ (%)
Oracle (HQ) 82.5 67.0 36.9
No restoration 60.8 47.4 14.5
SwinIR 70.0 56.2 21.0
SR4IR 71.5 56.2 22.3
EDTR-1 step 74.4 64.1 31.2
EDTR-4 step 74.1 65.3 33.4

Ablation Study

Task performance on Mixture-B (more complex degradation)

Method Classification Acc↑ (%) Segmentation mIoU↑ (%) Detection mAP↑ (%)
No restoration 47.6 40.2 0.0
SwinIR 60.7 50.4 22.1
SR4IR 63.4 51.0 22.6
EDTR-1 step 68.8 60.4 30.6
EDTR-4 step 68.4 62.9 33.4

Perceptual quality metrics comparison (Mixture-A)

Method NIQE↓ Q-Align↑ PSNR↑
SwinIR 9.97 2.55 25.21
SR4IR 5.41 3.37 24.07
EDTR-1 step 4.68 3.53 23.63
EDTR-4 step 4.26 3.81 22.46

Key Findings

  1. EDTR achieves the most pronounced gains on detection: on Mixture-A, mAP improves from 14.5% (no restoration) to 33.4% (+18.9%); on Mixture-B, from 0.0% to 33.4%.
  2. PSNR and task performance are negatively correlated: EDTR's PSNR is lower than SwinIR's, yet it leads substantially on all task metrics, demonstrating that pixel fidelity does not equate to task effectiveness.
  3. Short-step denoising is critical for task performance: 1-step and 4-step results are comparable but both substantially outperform conventional 50-step denoising.
  4. Perceptual quality (NIQE/Q-Align) and task performance can be improved simultaneously, challenging the assumption that a trade-off between the two is inevitable.

Highlights & Insights

  • The PSNR–task performance paradox: This work clearly demonstrates, for the first time within a TDIR framework, the inconsistency between pixel-level metrics and task-level metrics, which carries important methodological implications.
  • Elegant design of partial diffusion: The pipeline of pre-restoration → mild noise addition → short-step denoising simultaneously preserves useful information from the LQ image and leverages the diffusion prior to supplement high-frequency details.
  • One denoising step suffices: This challenges the conventional wisdom that more diffusion steps yield better results; in the TDIR setting, a single step is sufficient.
  • Generality: The same framework applies to three fundamentally different tasks—classification, segmentation, and detection.

Limitations & Future Work

  1. Reliance on a pretrained StableDiffusion model incurs substantial computational cost (SD inference + ControlNet + joint task network training).
  2. HQ images are required for training, making the method inapplicable to fully unsupervised real-world scenarios.
  3. Only three vision tasks are validated; effectiveness on finer-grained tasks (e.g., keypoint detection, instance segmentation) remains unknown.
  4. Degradation types must be predefined during training; generalization to unknown degradations has yet to be verified.
  5. The quality of the pre-restoration network SwinIR directly impacts final results.
  • SR4IR: Proposed the joint training paradigm and task-driven perceptual loss for TDIR, serving as the direct predecessor of this work.
  • StableSR / DiffBIR: Demonstrated the powerful capability of SD as an image prior, without considering task-driven objectives.
  • ControlNet: Provides a mechanism for conditioning a frozen SD model on external inputs.
  • Insight: The generative power of diffusion models must be "tamed" to serve specific tasks; blindly pursuing perceptual quality can actually harm task performance.

Rating

  • Novelty: ⭐⭐⭐⭐ First effective use of diffusion priors in TDIR; the partial diffusion + short-step denoising design is elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Three tasks × two degradation types with comprehensive ablations; dataset diversity could be further expanded.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated and the method pipeline is easy to follow, though notation is occasionally redundant.
  • Value: ⭐⭐⭐⭐ Provides valuable guidance on how to correctly exploit diffusion priors; the detection performance gains are particularly notable.