LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling¶

Conference: ICCV 2025 arXiv: 2507.00790 Code: https://github.com/AMAP-ML/LD-RPS Area: Image Generation Keywords: Zero-shot image restoration, posterior sampling, latent diffusion, recurrent refinement, multimodal prior

TL;DR¶

LD-RPS proposes a zero-shot, dataset-free unified image restoration method that performs recurrent posterior sampling via a pretrained latent diffusion model. It leverages multimodal large language models for semantic priors and a learnable F-PAM module to align the degradation domain, achieving high-quality blind restoration across diverse degradation types.

Background & Motivation¶

Unified Image Restoration (UIR) aims to handle multiple degradation types (noise, low-light, haze, colorization, etc.) within a single model and represents an important direction in low-level vision. Existing methods face three major challenges:

Task-specific methods lack generalization: Traditional methods (e.g., ZDCE++, AOD-Net) design networks for specific degradations and cannot generalize to other types.

Supervised unified methods are constrained to closed sets: Methods such as AirNet, PromptIR, and DiffUIR are trained on specific datasets and exhibit significant performance degradation when encountering unseen degradation types.

Existing posterior sampling methods are unstable: Methods like GDP rely on pixel-level diffusion and explicit degradation modeling (\(y = Ax + B\)), which is unsuitable for complex real-world degradations.

An ideal unified restoration solution should simultaneously satisfy: (1) unsupervised — no reliance on labeled data; (2) dataset-free — no need for training data collection; and (3) generalizable — capable of handling unseen degradation types.

The authors' core insight is twofold: latent space is more suitable for posterior sampling than pixel space, as latent representations filter out redundant pixel information and degradation noise; and recurrent sampling is more stable than single-pass sampling, as using the previous iteration's result as the next initialization progressively improves quality.

Method¶

Overall Architecture¶

The inference pipeline of LD-RPS comprises three core components:

MLLM Semantic Prior Generation: A multimodal large language model (e.g., GPT-4V) generates textual descriptions from low-quality images, which serve as text embeddings to guide the diffusion model.
F-PAM (Feature and Pixel Alignment Module): A lightweight learnable network that bridges the degraded image domain and the diffusion model's generation domain.
Recurrent Posterior Sampling: Extends single-pass posterior sampling into a multi-round recurrent refinement process.

Key Designs¶

1. Task-Blind Semantic Prior Generation

The method exploits the image understanding capability of MLLMs to extract semantic information from degraded images: a low-quality image and a hand-crafted prompt are fed to the MLLM → the MLLM generates a content description → the description is encoded as a text embedding \(c\) → the embedding guides the diffusion model toward the target content. This eliminates the need for manually specifying the degradation type.

2. F-PAM: Feature and Pixel Alignment Module

This is the central design for addressing the unique challenges of LD-RPS. Two gaps must be bridged: - Spatial gap: dimensional mismatch between latent space \(z\) and image space \(x\) - Domain gap: distributional discrepancy between the clean image domain and the degraded image domain

F-PAM structure: \(\psi[\tilde{z}_0, \tilde{z}_0'] = h_2(h_1(f[\tilde{z}_0, \tilde{z}_0'])) + p \odot h_1(f[\tilde{z}_0, \tilde{z}_0'])\)

where \(f\) is a frozen VAE decoder, \(h_1/h_2\) are learnable convolutional networks, and \(p\) is a learnable channel attention factor. F-PAM is jointly optimized with the reverse diffusion process using L2 loss + perceptual loss + GAN loss.

3. Two-Stage Posterior Sampling

The reverse diffusion process is divided into two stages: - \(T \to t_1\) (early stage): Only F-PAM is trained; \(g = 0\), with no intervention in the diffusion direction. - \(t_1 \to 0\) (late stage): F-PAM and the posterior direction are jointly optimized, correcting the sampling trajectory via gradient \(g = \nabla_{z_t} \log p(y|\hat{z}_0)\).

The posterior loss includes: - Distance loss \(L\): L2 + perceptual loss + GAN loss (degraded-to-degraded domain alignment) - Quality loss \(Q\): brightness constraint + chrominance consistency constraint

4. Recurrent Refinement

The core idea is to re-encode and re-noise the restoration result \(x_0^{(i)}\) from iteration \(i\) to noise level \(\gamma T\), using it as the initialization for iteration \(i+1\). Each round starts from a lower noise level, yielding greater stability. The recurrence factor \(\gamma \in (0,1)\) controls the degree of re-noising.

Loss & Training¶

LD-RPS is a purely inference-time method requiring no pretraining. However, online optimization occurs during inference:

F-PAM training loss \(S_\psi\): L2 reconstruction + VGG perceptual + GAN adversarial
Posterior guidance loss \(L_\text{total}\): distance loss (L2 + perceptual + GAN) + quality loss (brightness + chrominance)
Type discriminator \(D_2\): distinguishes residuals between "clean–degraded" and "generated–degraded version" pairs

All experiments are conducted on NVIDIA H20 GPUs, and results are averaged over 3 random seeds.

Key Experimental Results¶

Main Results¶

Low-light Enhancement (LOLv1 dataset):

Method	Definition (B/D/U)	PSNR↑	SSIM↑	LPIPS↓	PI↓	NIQE↓
DiffUIR	✓/✗/✗	21.36	0.907	0.125	4.68	5.95
ZERO-IG	✗/✓/✓	17.22	0.794	0.184	4.92	6.22
GDP	✗/✓/✓	16.52	0.690	0.261	4.16	5.73
TAO	✓/✓/✓	15.84	0.757	0.363	6.34	8.79
LD-RPS	✓/✓/✓	17.45	0.804	0.277	4.79	5.52

Dehazing (RESIDE-HSTS dataset):

Method	PSNR↑	SSIM↑	LPIPS↓
YOLY	20.49	0.794	0.108
GDP	13.15	0.757	0.144
TAO	18.38	0.823	0.147
LD-RPS	21.45	0.813	0.177

Ablation Study¶

Effect of recurrence count (LOLv1 / RESIDE / Kodak24):

Recurrence Count	LOLv1 PSNR↑	RESIDE PSNR↑	Kodak24 PSNR↑
0	16.78	19.35	27.75
1	17.21	20.38	28.60
2	17.73	20.83	28.26
3	17.10	21.60	28.49

The optimal recurrence count correlates with the degree of coupling between degradation and semantics: stronger coupling (e.g., dehazing) requires more iterations.

Ablation of text guidance:

Setting	LOLv1 PSNR	RESIDE PSNR	Kodak24 PSNR
w/o Text	16.03	19.63	28.13
Full (w/ Text)	17.73 (+1.70)	21.60 (+1.97)	28.60 (+0.47)

Textual priors yield significant improvements across all tasks, most notably for dehazing (+1.97 PSNR).

Key Findings¶

LD-RPS surpasses all posterior sampling baselines in the zero-shot setting: It outperforms GDP and TAO on low-light enhancement, dehazing, and denoising.
Recurrent refinement is effective but not monotonically beneficial: An optimal recurrence count exists; excessive iterations may degrade quality.
Textual priors are a critical performance factor: Semantic descriptions generated by the MLLM provide essential directional guidance for the diffusion model.
F-PAM addresses implicit degradation modeling: Compared to GDP's explicit modeling (\(y = Ax + B\)), F-PAM adapts to complex nonlinear degradations.

Highlights & Insights¶

Posterior sampling in latent space is a forward-looking idea: Compared to pixel space, latent space inherently suppresses degradation noise, making it naturally advantageous for restoration.
MLLMs provide zero-shot semantic priors: The approach cleverly leverages large models' image understanding capabilities to compensate for the absence of degradation-type priors.
Recurrent refinement is simple yet effective: Inspired by bootstrap thinking, it converts the instability of single-pass sampling into the stability of iterative refinement.
Genuinely unified and zero-shot: Simultaneously satisfies task-blind, dataset-free, and unsupervised conditions.

Limitations & Future Work¶

Slow inference speed: Recurrent sampling combined with online F-PAM training results in long processing times per image.
Color bias: Color shifts still occur in certain scenarios, requiring quality loss \(Q\) as a constraint.
Dependence on MLLM quality: The quality of textual priors depends on the MLLM's ability to understand degraded images, which may fail under severe degradation.
Unstable GAN discriminator training: Online discriminator training may introduce instability.
Lack of evaluation on super-resolution and deblurring: Validation is limited to enhancement, dehazing, denoising, and colorization; spatial degradation types are not covered.

GDP: A pixel-space diffusion posterior sampling method; the direct improvement target of LD-RPS.
TAO: A test-time adaptive diffusion method and another posterior sampling baseline.
DiffUIR / DA-CLIP: Supervised unified restoration methods, constrained to closed sets.
AirNet / PromptIR: Degradation-aware unified restoration methods requiring paired training data.
Insights: The combination of latent space + learnable degradation mapping + recurrent refinement constitutes a powerful paradigm for zero-shot restoration; MLLMs can serve as general-purpose semantic prior providers.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐