CVPR2025 Image Restoration AI paper notes paper summaries Super-Resolution Diffusion Models Adversarial Robustness Self-Supervised Learning

🖼️ Image Restoration¶

📷 CVPR2025 · 41 paper notes

📌 Same area in other venues: 📷 CVPR2026 (135) · 🔬 ICLR2026 (61) · 🧪 ICML2026 (21) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (26) · 📹 ICCV2025 (31)

🔥 Top topics: Image Restoration ×15 · Super-Resolution ×6 · Diffusion Models ×6 · Adversarial Robustness ×3 · Self-Supervised Learning ×2

A Flag Decomposition for Hierarchical Datasets: This paper proposes Flag Decomposition (FD), an algorithm that decomposes hierarchically structured data into flag manifold representations (Stiefel coordinates) while preserving hierarchical relationships. It demonstrates advantages over standard methods like SVD in denoising, clustering, and few-shot learning tasks.
A Physics-Informed Blur Learning Framework for Imaging Systems: A physics-informed PSF learning framework is proposed, designing a new wavefront basis (where each basis only affects a single SFR direction) to eliminate gradient conflicts. Combined with curriculum learning (from center to periphery), it accurately estimates the spatially-varying PSF of imaging systems without requiring lens parameters.

EQ-Reg: A Regularization-Guided Equivariant Approach for Image Restoration

AdcSR: Adversarial Diffusion Compression for Real-World Image Super-Resolution: An Adversarial Diffusion Compression (ADC) framework is proposed to distill the one-step diffusion model OSEDiff into a streamlined diffusion-GAN hybrid model. This achieves a 73% reduction in inference time, a 78% reduction in computational cost, and a 74% reduction in parameters while maintaining generative quality, reaching real-time super-resolution at 34.79 FPS.
Augmenting Perceptual Super-Resolution via Image Quality Predictors: No-reference image quality assessment (NR-IQA) models are leveraged to replace human annotations. By improving perceptual super-resolution quality through weighted sampling and direct optimization, the proposed method outperforms state-of-the-art methods that rely on human feedback, without requiring any human-labeled data.
Classic Video Denoising in a Machine Learning World: Robust, Fast, and Controllable: Revisit classic video denoising methods and integrate them with modern ML tools to achieve robust, fast, and noise-level controllable video denoising.
Complexity Experts are Task-Discriminative Learners for Any Image Restoration: MoCE-IR is proposed to replace the uniform architecture of traditional MoEs with "complexity experts" possessing varying computational complexities and receptive field sizes. Complemented by a spring-like routing mechanism biased towards low complexity, it unexpectedly achieves task-discriminative allocation—different degradation types are automatically routed to experts of appropriate complexity, allowing irrelevant experts to be bypassed during inference.
DarkIR: Robust Low-Light Image Restoration: DarkIR proposes an efficient CNN-based multi-task low-light image restoration method. The encoder uses SpAM+FreMLP (frequency magnitude enhancement) to handle illumination, while the decoder utilizes Di-SpAM (dilated spatial attention) to handle blur. With an asymmetric design, it achieves 27.30dB PSNR on LOLBlur with only 3.31M parameters.
Degradation-Aware Feature Perturbation for All-in-One Image Restoration: This paper proposes the DFPIR framework, which adapts the feature space between the encoder and decoder to fit a unified parameter space through two mechanisms: degradation type-guided channel shuffle perturbation and selective attention mask perturbation. It achieves state-of-the-art (SOTA) performance across five distinct tasks, including denoising, dehazing, deraining, deblurring, and low-light enhancement.
Detail-Preserving Latent Diffusion for Stable Shadow Removal: This paper proposes a two-stage Stable Diffusion fine-tuning scheme for shadow removal: In the first stage, the denoiser is fine-tuned in the latent space to perform primary shadow removal. In the second stage, a shadow-aware Detail Injection module extracts features from the VAE encoder to modulate the decoder, recovering the high-frequency details lost in the first stage and achieving high-quality and highly generalizable shadow removal.
DnLUT: Ultra-Efficient Color Image Denoising via Channel-Aware Lookup Tables: This work proposes DnLUT, an ultra-efficient color image denoising framework based on lookup tables (LUTs). By employing a Pairwise Channel Mixer (PCM) to capture inter-channel correlation and an L-shaped convolution kernel to expand the receptive field, DnLUT achieves state-of-the-art LUT denoising performance with only 500KB of storage and 0.1% of the energy consumption of DnCNN.
DPIR: Dual Prompting Image Restoration with Diffusion Transformers: This paper proposes DPIR, the first image restoration method based on the Diffusion Transformer (SD3). By utilizing a lightweight low-quality image conditioning branch and a dual visual-textual prompting control branch, DPIR enhances restoration quality and fidelity across both global context and local appearance visual dimensions.
EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation: This paper proposes an Audio-Pose Dynamic Harmonization (APDH) strategy to progressively shift control from full-body poses to audio—gradually removing keypoints (retaining hands) while expanding the audio control scope (from lips to the full body). This secures high-quality semi-body animation driven only by audio, a reference image, and hand poses.
Efficient Diffusion as Low Light Enhancer (ReDDiT): ReDDiT is proposed to distill diffusion-based low-light enhancement from 10+ steps down to 2-4 steps. By correcting fitting errors via linear extrapolation and refining trajectories with Retinex-decomposed reflectance to bridge the inference gap, it achieves state-of-the-art (SOTA) performance across 10 benchmarks in just 4 steps.
Efficient Visual State Space Model for Image Deblurring: This paper proposes EVSSM, which efficiently captures non-local information by applying alternating geometric transformations (transpose/flip) before unidirectional SSM scanning, and designs an efficient discriminative frequency-domain FFN (EDFFN) to enhance local details. It outperforms existing SSM methods and achieves SOTA on image deblurring tasks with only 1/4 of the computational cost.
FiRe: Fixed-points of Restoration Priors for Solving Inverse Problems: This paper proposes the FiRe framework, which formulates explicit image priors using fixed-point theory by composing general-purpose image restoration models (e.g., deblurring, super-resolution, inpainting) with their training degradation operators. This generalizes traditional Plug-and-Play (PnP) beyond denoiser-only priors and supports the ensemble of multiple restoration models, significantly outperforming existing PnP and diffusion-based approaches on various inverse problems.
Generalized Recorrupted-to-Recorrupted: Self-Supervised Learning Beyond Gaussian Noise: This paper proposes Generalized R2R (GR2R), generalizing the original self-supervised denoising framework Recorrupted-to-Recorrupted (R2R) from Gaussian noise to natural exponential family (NEF) distributions—including Poisson, Gamma, and Binomial noise. It proves that the GR2R loss is an unbiased estimator of supervised loss, with SURE being its special case, achieving performance close to supervised learning in applications like low-light imaging and SAR.
Gyro-based Neural Single Image Deblurring: This paper proposes GyroDeblurNet, which represents complex hand shake through a novel camera motion field embedding. It features a gyro refinement module that utilizes image blur information to correct gyro errors, and a gyro deblurring module that removes blur using the corrected motion information. Combined with a curriculum learning strategy, GyroDeblurNet significantly outperforms existing methods on both synthetic and real-world datasets.
HVI: A New Color Space for Low-light Image Enhancement: This paper proposes a new color space HVI (Horizontal/Vertical-Intensity), which eliminates red artifacts through polarized HS mapping, compresses black artifacts in dark regions using a learnable intensity component, and outperforms existing low-light enhancement state-of-the-art methods across 10 datasets in combination with the decoupling network CIDNet.
INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations: INFP proposes a unified, audio-driven interactive head generation framework. By utilizing dual-track audio (agent + conversational partner), the framework naturally drives the agent to switch between speaking and listening states without manual role assignment or explicit role switching, while introducing the large-scale DyConv dataset to support research in this field.
Iterative Predictor-Critic Code Decoding for Real-World Image Dehazing: IPC-Dehaze proposes an iterative Predictor-Critic decoding framework based on a VQGAN codebook prior. By utilizing a Code-Critic to evaluate the inter-relations among codebook sequences to determine which codes should be retained or resampled, the framework achieves progressive, easy-to-hard dehazing from clear regions to dense haze regions, significantly surpassing state-of-the-art methods in real-world scenarios.
MaIR: A Locality- and Continuity-Preserving Mamba for Image Restoration: This work proposes MaIR. Its key innovations include a Nested S-shaped Scanning (NSS) strategy that preserves locality via stripe partitioning and continuity via S-shaped paths, and a Sequence Shuffle Attention (SSA) module that intelligently aggregates sequences from different scanning directions using channel-level attention. MaIR achieves state-of-the-art (SOTA) performance across 14 datasets in four major tasks: super-resolution, denoising, deblurring, and dehazing.
MambaIRv2: Attentive State Space Restoration: This work proposes MambaIRv2, which injects learnable prompts into the output matrix \(\mathbf{C}\) of Mamba via the Attentive State-space Equation (ASE) to enable attention-like non-causal global querying. It also introduces Semantic Guided Neighboring (SGN) to rearrange sequences according to semantic labels, alleviating long-range decay. Requiring only a single-direction scan, it outperforms multi-directional methods, surpassing SRFormer by 0.35dB on lightweight SR with 9.3% fewer parameters.
One-Step Event-Driven High-Speed Autofocus: An Event Laplacian Product (ELP) focus detection function is proposed, which combines event data and intensity Laplacian information to reformulate the focus search as a detection task, achieving event-driven one-step autofocus for the first time, reducing focus time by 2/3 and decreasing autofocus errors by 22-24 times.
PIDSR: Complementary Polarized Image Demosaicing and Super-Resolution: PIDSR proposes a framework for joint complementary optimization of polarized image demosaicing (PID) and polarization image super-resolution (PISR). Utilizing a two-stage recurrent pipeline (spatial-physical coherent reconstruction + polarization-aware resolution enhancement) and a Stokes-assisted network, it directly reconstructs high-quality high-resolution polarization images from CPFA raw images, significantly reducing errors in DoP and AoP.
Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach: PiSA-SR is proposed, which decouples pixel-level regression and semantic-level enhancement into two independent weight spaces via a dual-LoRA module. This achieves single-step diffusion for high-quality super-resolution and supports flexible adjustment of fidelity and perceptual quality at inference time using two guidance scales.
PolarFree: Polarization-based Reflection-Free Imaging: A large-scale RGB-polarization dataset, PolaRGB, consisting of 6,500 pairs was constructed. A two-stage network, PolarFree, was proposed, which first employs a conditional diffusion model to generate a reflection-free prior and then utilizes a de-reflection backbone network to separate the transmission layer. This approach outperforms previous methods by approximately 2dB PSNR in polarization-guided reflection removal.
POLISH'ing the Sky: Wide-Field and High-Dynamic Range Interferometric Image Reconstruction with Application to Strong Lens Discovery: Based on the POLISH framework, POLISH+/++ is proposed with two key improvements—patch-wise training + stitched inference and arcsinh non-linear transformation—enabling deep learning methods to handle wide-field (\(12,960 \times 12,960\) pixels) and high-dynamic-range (\(\sim 10^6\)) radio interferometric imaging for the first time, while demonstrating a \(10\times\) potential increase in strong gravitational lensing discovery through super-resolution.
Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models: NaviBridger introduces Denoising Diffusion Bridge Models (DDBM) to visual navigation tasks, replacing Gaussian noise with information-rich prior actions as the denoising starting point. It theoretically proves that a source distribution closer to the target distribution yields a lower error upper bound, and designs three prior strategies (Gaussian, rule-based, and learning-based) to accelerate inference and surpass baselines in both indoor/outdoor simulations and real-world scenarios.
Progressive Focused Transformer for Single Image Super-Resolution: PFT proposes a Progressive Focused Attention (PFA) mechanism, which transfers the Hadamard product of attention maps between adjacent Transformer layers to filter out irrelevant tokens layer-by-layer and enhance the weights of key tokens, achieving state-of-the-art performance on super-resolution tasks while significantly reducing computational overhead.
Proximal Algorithm Unrolling: Flexible and Efficient Reconstruction Networks for Single-Pixel Imaging: The authors propose ProxUnroll, which trains HQS/ADMM unrolling networks by designing a proximal trajectory (PT) loss function. This forces the deep image restorer (DIR) within the network to approximate the proximal operator of an ideal regularization, thereby equipping the unrolling network with both the flexibility of PnP algorithms (a single model for arbitrary compression ratios) and the high accuracy and fast speed of unrolling networks.
QMambaBSR: Burst Image Super-Resolution with Query State Space Model: QMambaBSR is proposed to achieve joint sub-pixel extraction and noise suppression through inter-frame query and intra-frame scanning using the Query State Space Model (QSSM). Combined with an adaptive upsampling module, it achieves SOTA performance on both synthetic and real burst image super-resolution tasks.
Reversible Decoupling Network for Single Image Reflection Removal: RDNet proposes a single image reflection removal method based on a reversible decoupling architecture, which ensures lossless transmission of multi-scale semantic information during forward propagation through a multi-column reversible encoder, and designs a transmission-rate-aware prompt generator to adaptively handle varying reflection intensities. It comprehensively outperforms SOTAs on five benchmark datasets and won the NTIRE 2025 challenge.
Rotation-Equivariant Self-Supervised Method in Image Denoising: This work introduces rotation-equivariant convolutions to self-supervised image denoising for the first time. It rigorously analyzes the impact of up/downsampling operators on equivariance, provides the equivariant error bounds of the complete U-Net architecture, and proposes an adaptive rotation-equivariant network, AdaReNet. Through a learning-based mask fusion module, AdaReNet automatically determines which regions of an image are better suited for the rotation-equivariant network, achieving consistent performance improvements across three classic self-supervised methods: N2N, N2V, and R2R.
SoftShadow: Leveraging Soft Masks for Penumbra-Aware Shadow Removal: This work proposes the SoftShadow framework, which replaces traditional binary hard masks with continuous grayscale soft masks to represent shadow regions. It predicts soft masks via SAM+LoRA and introduces a penumbra formation constraint loss to jointly train the detection and shadow removal networks, achieving SOTA performance on four datasets (SRD, ISTD+, LRSS, UIUC) without requiring external mask inputs.
Tokenize Image Patches: Global Context Fusion for Effective Haze Removal in Large Images: DehazeXL proposes an end-to-end framework for large image dehazing. By splitting the input image into fixed-size patches and encoding them into tokens, it leverages an efficient global attention module to fuse contextual information. This enables inference on 10240×10240 images with only 21GB of VRAM and achieves state-of-the-art (SOTA) performance on a self-built 8K dehazing dataset.
OptiFusion: Towards Universal Computational Aberration Correction in Photographic Cameras: By extending OptiFusion to automatically design 120 diverse lenses, this work proposes the ODE comprehensive evaluation metric and a large-scale benchmark. Systematically comparing 24 algorithms, it reveals that CNN models provide the best speed-accuracy trade-off for aberration correction, counter-intuitively outperforming Transformers.
URWKV: Unified RWKV Model with Multi-State Perspective for Low-Light Image Restoration: This paper proposes the URWKV model, which introduces a multi-state (intra-stage and inter-stage) perspective into the RWKV architecture. Through Lightness-Adaptive Normalization (LAN), State-aware Quad-directional Token Shift (SQ-Shift), and State-aware Selective Fusion (SSF) modules, a unified model is developed to handle the dynamically coupled degradations (noise, low-light distortion, and motion blur) in low-light images. With only 2.25M parameters, the proposed model comprehensively outperforms existing methods across 8 benchmark datasets.
Variational Garrote for Sparse Inverse Problems: This paper systematically compares the performance of \(\ell_1\) regularization (LASSO) and Variational Garrote (VG, a probabilistic approximation of \(\ell_0\)) across three inverse problems: signal resampling, denoising, and sparse-view CT reconstruction. It is demonstrated that VG typically achieves lower generalization errors in highly underdetermined regimes (such as low sampling rates or highly sparse angles) because the spike-and-slab prior aligns better with the true sparse distribution.
Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks: This paper proposes VLU-Net, the first All-in-One Deep Unfolding Network (DUN) framework. It utilizes a fine-tuned CLIP model to automatically detect degradation types and guide the gradient descent module. Combined with a hierarchical feature unfolding structure, VLU-Net outperforms the state-of-the-art end-to-end method by 3.74dB on image dehazing.
Visual-Instructed Degradation Diffusion for All-in-One Image Restoration: Defusion proposes replacing textual instructions with "visual instructions" to guide all-in-one image restoration. By applying degradation effects to standardized visual elements, it constructs visual degradation descriptions. It performs diffusion denoising in the degradation space (instead of the image space), surpassing both task-specific and all-in-one methods across 8 restoration tasks.