LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling¶
Conference: ICCV 2025 arXiv: 2507.00790 Code: https://github.com/AMAP-ML/LD-RPS Area: Image Generation Keywords: Zero-shot image restoration, posterior sampling, latent diffusion, recurrent refinement, multimodal prior
TL;DR¶
LD-RPS proposes a zero-shot, dataset-free unified image restoration method that performs recurrent posterior sampling via a pretrained latent diffusion model. It leverages multimodal large language models for semantic priors and a learnable F-PAM module to align the degradation domain, achieving high-quality blind restoration across diverse degradation types.
Background & Motivation¶
Unified Image Restoration (UIR) aims to handle multiple degradation types (noise, low-light, haze, colorization, etc.) within a single model and represents an important direction in low-level vision. Existing methods face three major challenges:
Task-specific methods lack generalization: Traditional methods (e.g., ZDCE++, AOD-Net) design networks for specific degradations and cannot generalize to other types.
Supervised unified methods are constrained to closed sets: Methods such as AirNet, PromptIR, and DiffUIR are trained on specific datasets and exhibit significant performance degradation when encountering unseen degradation types.
Existing posterior sampling methods are unstable: Methods like GDP rely on pixel-level diffusion and explicit degradation modeling (\(y = Ax + B\)), which is unsuitable for complex real-world degradations.
An ideal unified restoration solution should simultaneously satisfy: (1) unsupervised — no reliance on labeled data; (2) dataset-free — no need for training data collection; and (3) generalizable — capable of handling unseen degradation types.
The authors' core insight is twofold: latent space is more suitable for posterior sampling than pixel space, as latent representations filter out redundant pixel information and degradation noise; and recurrent sampling is more stable than single-pass sampling, as using the previous iteration's result as the next initialization progressively improves quality.
Method¶
Overall Architecture¶
The inference pipeline of LD-RPS comprises three core components:
- MLLM Semantic Prior Generation: A multimodal large language model (e.g., GPT-4V) generates textual descriptions from low-quality images, which serve as text embeddings to guide the diffusion model.
- F-PAM (Feature and Pixel Alignment Module): A lightweight learnable network that bridges the degraded image domain and the diffusion model's generation domain.
- Recurrent Posterior Sampling: Extends single-pass posterior sampling into a multi-round recurrent refinement process.
Key Designs¶
1. Task-Blind Semantic Prior Generation
The method exploits the image understanding capability of MLLMs to extract semantic information from degraded images: a low-quality image and a hand-crafted prompt are fed to the MLLM → the MLLM generates a content description → the description is encoded as a text embedding \(c\) → the embedding guides the diffusion model toward the target content. This eliminates the need for manually specifying the degradation type.
2. F-PAM: Feature and Pixel Alignment Module
This is the central design for addressing the unique challenges of LD-RPS. Two gaps must be bridged: - Spatial gap: dimensional mismatch between latent space \(z\) and image space \(x\) - Domain gap: distributional discrepancy between the clean image domain and the degraded image domain
F-PAM structure: \(\psi[\tilde{z}_0, \tilde{z}_0'] = h_2(h_1(f[\tilde{z}_0, \tilde{z}_0'])) + p \odot h_1(f[\tilde{z}_0, \tilde{z}_0'])\)
where \(f\) is a frozen VAE decoder, \(h_1/h_2\) are learnable convolutional networks, and \(p\) is a learnable channel attention factor. F-PAM is jointly optimized with the reverse diffusion process using L2 loss + perceptual loss + GAN loss.
3. Two-Stage Posterior Sampling
The reverse diffusion process is divided into two stages: - \(T \to t_1\) (early stage): Only F-PAM is trained; \(g = 0\), with no intervention in the diffusion direction. - \(t_1 \to 0\) (late stage): F-PAM and the posterior direction are jointly optimized, correcting the sampling trajectory via gradient \(g = \nabla_{z_t} \log p(y|\hat{z}_0)\).
The posterior loss includes: - Distance loss \(L\): L2 + perceptual loss + GAN loss (degraded-to-degraded domain alignment) - Quality loss \(Q\): brightness constraint + chrominance consistency constraint
4. Recurrent Refinement
The core idea is to re-encode and re-noise the restoration result \(x_0^{(i)}\) from iteration \(i\) to noise level \(\gamma T\), using it as the initialization for iteration \(i+1\). Each round starts from a lower noise level, yielding greater stability. The recurrence factor \(\gamma \in (0,1)\) controls the degree of re-noising.
Loss & Training¶
LD-RPS is a purely inference-time method requiring no pretraining. However, online optimization occurs during inference:
- F-PAM training loss \(S_\psi\): L2 reconstruction + VGG perceptual + GAN adversarial
- Posterior guidance loss \(L_\text{total}\): distance loss (L2 + perceptual + GAN) + quality loss (brightness + chrominance)
- Type discriminator \(D_2\): distinguishes residuals between "clean–degraded" and "generated–degraded version" pairs
All experiments are conducted on NVIDIA H20 GPUs, and results are averaged over 3 random seeds.
Key Experimental Results¶
Main Results¶
Low-light Enhancement (LOLv1 dataset):
| Method | Definition (B/D/U) | PSNR↑ | SSIM↑ | LPIPS↓ | PI↓ | NIQE↓ |
|---|---|---|---|---|---|---|
| DiffUIR | ✓/✗/✗ | 21.36 | 0.907 | 0.125 | 4.68 | 5.95 |
| ZERO-IG | ✗/✓/✓ | 17.22 | 0.794 | 0.184 | 4.92 | 6.22 |
| GDP | ✗/✓/✓ | 16.52 | 0.690 | 0.261 | 4.16 | 5.73 |
| TAO | ✓/✓/✓ | 15.84 | 0.757 | 0.363 | 6.34 | 8.79 |
| LD-RPS | ✓/✓/✓ | 17.45 | 0.804 | 0.277 | 4.79 | 5.52 |
Dehazing (RESIDE-HSTS dataset):
| Method | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| YOLY | 20.49 | 0.794 | 0.108 |
| GDP | 13.15 | 0.757 | 0.144 |
| TAO | 18.38 | 0.823 | 0.147 |
| LD-RPS | 21.45 | 0.813 | 0.177 |
Ablation Study¶
Effect of recurrence count (LOLv1 / RESIDE / Kodak24):
| Recurrence Count | LOLv1 PSNR↑ | RESIDE PSNR↑ | Kodak24 PSNR↑ |
|---|---|---|---|
| 0 | 16.78 | 19.35 | 27.75 |
| 1 | 17.21 | 20.38 | 28.60 |
| 2 | 17.73 | 20.83 | 28.26 |
| 3 | 17.10 | 21.60 | 28.49 |
The optimal recurrence count correlates with the degree of coupling between degradation and semantics: stronger coupling (e.g., dehazing) requires more iterations.
Ablation of text guidance:
| Setting | LOLv1 PSNR | RESIDE PSNR | Kodak24 PSNR |
|---|---|---|---|
| w/o Text | 16.03 | 19.63 | 28.13 |
| Full (w/ Text) | 17.73 (+1.70) | 21.60 (+1.97) | 28.60 (+0.47) |
Textual priors yield significant improvements across all tasks, most notably for dehazing (+1.97 PSNR).
Key Findings¶
- LD-RPS surpasses all posterior sampling baselines in the zero-shot setting: It outperforms GDP and TAO on low-light enhancement, dehazing, and denoising.
- Recurrent refinement is effective but not monotonically beneficial: An optimal recurrence count exists; excessive iterations may degrade quality.
- Textual priors are a critical performance factor: Semantic descriptions generated by the MLLM provide essential directional guidance for the diffusion model.
- F-PAM addresses implicit degradation modeling: Compared to GDP's explicit modeling (\(y = Ax + B\)), F-PAM adapts to complex nonlinear degradations.
Highlights & Insights¶
- Posterior sampling in latent space is a forward-looking idea: Compared to pixel space, latent space inherently suppresses degradation noise, making it naturally advantageous for restoration.
- MLLMs provide zero-shot semantic priors: The approach cleverly leverages large models' image understanding capabilities to compensate for the absence of degradation-type priors.
- Recurrent refinement is simple yet effective: Inspired by bootstrap thinking, it converts the instability of single-pass sampling into the stability of iterative refinement.
- Genuinely unified and zero-shot: Simultaneously satisfies task-blind, dataset-free, and unsupervised conditions.
Limitations & Future Work¶
- Slow inference speed: Recurrent sampling combined with online F-PAM training results in long processing times per image.
- Color bias: Color shifts still occur in certain scenarios, requiring quality loss \(Q\) as a constraint.
- Dependence on MLLM quality: The quality of textual priors depends on the MLLM's ability to understand degraded images, which may fail under severe degradation.
- Unstable GAN discriminator training: Online discriminator training may introduce instability.
- Lack of evaluation on super-resolution and deblurring: Validation is limited to enhancement, dehazing, denoising, and colorization; spatial degradation types are not covered.
Related Work & Insights¶
- GDP: A pixel-space diffusion posterior sampling method; the direct improvement target of LD-RPS.
- TAO: A test-time adaptive diffusion method and another posterior sampling baseline.
- DiffUIR / DA-CLIP: Supervised unified restoration methods, constrained to closed sets.
- AirNet / PromptIR: Degradation-aware unified restoration methods requiring paired training data.
- Insights: The combination of latent space + learnable degradation mapping + recurrent refinement constitutes a powerful paradigm for zero-shot restoration; MLLMs can serve as general-purpose semantic prior providers.
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐