Skip to content

🖼️ Image Restoration

📷 CVPR2026 · 107 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (61) · 🧪 ICML2026 (21) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (26) · 📹 ICCV2025 (31)

🔥 Top topics: Image Restoration ×31 · Super-Resolution ×22 · Diffusion Models ×21 · Adversarial Robustness ×6 · Self-Supervised Learning ×6

2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition

This paper proposes a "two-shots are enough" sensor noise synthesis method—requiring only one noise image and one dark frame per ISO. It synthesizes signal-independent noise as a texture using random phase sampling in the Fourier domain, complemented by iterative histogram matching to correct marginal distributions. This allows for the generation of infinitely diverse training pairs without large-scale paired datasets, enabling denoising networks to achieve SOTA performance among physics-based methods on several low-light benchmarks.

AceTone: Bridging Words and Colors for Conditional Image Grading

AceTone is proposed as the first unified framework supporting multimodal conditional color grading for both text and reference images. It compresses 3D-LUT into 64 discrete tokens via VQ-VAE, trains a VLM to predict LUT token sequences, and utilizes GRPO reinforcement learning to align color similarity and aesthetic preferences, achieving a 50% LPIPS improvement in style transfer and instruction-based grading.

Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration

The IQPIR framework is proposed, which introduces Image Quality Priors (IQP) from pre-trained NR-IQA models as conditioning signals. Through three mechanisms—a quality-conditioned Transformer, a dual-codebook structure, and quality optimization in discrete representation space—the model guides the restoration process toward the highest perceptual quality, comprehensively outperforming SOTA on tasks such as blind face restoration.

Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion

This paper challenges the convention that Infrared and Visible Image Fusion (IVIF) must be trained on "strictly aligned paired data." It proposes the Arbitrarily Paired Training Paradigm (APTP)—freely recombining \(N\) pairs of base data into \(N^2\) cross-modal pairs, equipped with a set of adaptively weighted pixel-level self-supervised losses. Trained on only 150 pairs of content-inconsistent data, it approaches the fusion performance of models trained on 100 times the amount of strictly paired data.

Beyond the Ground Truth: Enhanced Supervision for Image Restoration

This paper proposes enhancing the perceptual quality of sub-optimal GT images in existing datasets through super-resolution combined with frequency-adaptive mixing. It introduces a lightweight ORNet refinement module that can be trained to improve the perceptual quality of outputs from pre-trained restoration models without architectural modifications.

BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting

Starting from a single blurry EHT black hole image, BHCast performs super-resolution and long-term autoregressive prediction (stable for 100 steps) via a U-Net dynamics surrogate model. Physical features (rotation speed, pitch angle, etc.) are extracted from the predicted plasma dynamics, and black hole spin and inclination are inferred using XGBoost, demonstrating effectiveness on real M87* observational images.

Bi-Bridge: Bidirectional Diffusion Bridges for Low-Light Image Enhancement

This work integrates "low-light to normal-light" enhancement and "normal-light to low-light" degradation into a single symmetric diffusion bridge. By training a shared U-Net with a bidirectional consistency constraint as implicit regularization, the model significantly outperforms existing SOTA in fidelity (PSNR/LPIPS).

BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement

To address the issues of event streams being contaminated by BA noise and the separation of denoising from enhancement in event-aided low-light enhancement, BiEvLight reformulates event denoising from a static preprocessing step into a task-aware bi-level optimization problem. This allows the enhancement gain from the lower level to calibrate the upper-level denoising, supplemented by a spatially adaptive denoising prior guided by image gradients. It achieves an average gain of 1.30dB PSNR / 0.047 SSIM on the real-world SDE dataset.

BiProLoRA: Bilevel Prompt LoRA for Real Scene Recovery

To address the severe degradation issue when large diffusion models "trained on synthetic data generalize to real scenes," BiProLoRA first calibrates the VAE auto-encoder path to the real degradation distribution via self-supervised distribution fidelity learning. It then formulates "LoRA for structure recovery and Prompt for degradation-aware modulation" as a bilevel (hyperparameter optimization) problem for joint training. Using real data equivalent to only 10% of the synthetic data volume, it surpasses SOTA across five non-reference metrics in low-light, dehazing, and underwater tasks.

BluRef: Unsupervised Image Deblurring with Dense-Matching References

Ours proposes BluRef, the first unsupervised framework that utilizes unpaired reference sharp images through dense matching to generate pseudo ground truth for training deblurring networks, achieving performance close to or even surpassing supervised methods.

Bridging the Perception Gap in Image Super-Resolution Evaluation

A large-scale user study reveals that existing SR evaluation metrics (PSNR, SSIM, LPIPS, etc.) are severely inconsistent with human perception. After analyzing these inherent defects, a minimalist yet effective framework called Relative Quality Index (RQI) is proposed. By learning the relative quality difference between image pairs, RQI achieves more reliable SR evaluation and can serve as a loss function to guide SR training.

CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation

CanonCGT decomposes "reference-based color grading" into two steps—first using a canonicalizer to "wash" the input image into a style-neutral "canonical pivot," and then using a grader to apply the tone of the reference image. Combined with a two-phase supervised and self-supervised training strategy (DP-CGT), PSNR on 6 datasets improved from the second-best 18.62 to 28.99, resulting in significantly more stable and natural results.

CASR: A Robust Cyclic Framework for Arbitrary Large-Scale Super-Resolution with Distribution Alignment and Self-Similarity Awareness

CASR decomposes "arbitrarily large-scale super-resolution (SR)" into a sequence of small-scale upsampling cycles that "always fall within the training distribution." Using a single model iteratively with two specific modules—the Superpixel Structure Alignment Module (SSAM) to suppress distribution drift during cycles, and the Self-Similarity Aware Refinement Module (SARM) to ensure texture consistency in patch-based reconstruction—it significantly outperforms existing arbitrary-scale methods in perceptual metrics like LPIPS and MUSIQ under extreme \(\times 8\) to \(\times 30\) magnification.

ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization

ColorFLUX decouples "structure preservation" and "color completion" into two mutually frozen training stages. This allows the FLUX generative diffusion model to learn accurate semantic colorization without interference from structural tasks. Subsequently, a coarse-to-fine progressive DPO post-training corrects fading unique to old photos, surpassing existing open-source and closed-source commercial models on both synthetic and real old photos.

Convexity-Aware Noise Calibration: A Self-Supervised Framework for Noise-Level-Unknown Image Denoising

Ours (CANC) discovers that after adding synthetic noise to a noisy image and applying Noisier2Noise correction, the variance of the denoised output is a convex curve with respect to the ratio \(k\) (synthetic noise/real noise variance), reaching its minimum at \(k=1\) (where synthetic noise exactly matches real noise). By using a network conditioned on synthetic noise variance combined with a ternary search, the minimum point is identified to accurately estimate the noise level \(\sigma_N\) without clean images or prior knowledge of noise levels. This estimate is then used to synthesize pairs for supervised training, enabling self-supervised denoising to match or even slightly exceed the performance of "noise-level-known" supervised models.

Coordinate Denoising for Non-Equilibrium Molecular Representation Learning

Addressing the flaw that the conclusion "coordinate denoising is equivalent to force field learning" only holds in equilibrium, this paper derives a denoising target NDeM valid for any conformation using second-order finite differences of the potential energy surface. This is implemented as a plug-and-play auxiliary task without pre-training, consistently improving force prediction accuracy for various equivariant GNNs on MD17, QM9, and OC20.

Degradation-Consistent Test-Time Adaptation for All-in-One Image Restoration

To address the performance drop of All-in-One Image Restoration (AiOIR) models when test-time degradation distributions deviate from training data, this paper proposes DCTTA. It utilizes a diffusion degradation generator at test time to learn the mapping from "pseudo-clean images to degraded images," constructing "degradation–re-degradation" self-supervised pairs. The model is fine-tuned online based on restoration consistency, while updating only degradation-sensitive parameters to preserve pre-trained knowledge. This approach achieves a PSNR gain of up to +4.57 dB on the Rain100H dataset.

Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios

Aiming at multimodal image fusion where source images are commonly degraded by noise, blur, or low resolution in real-world scenarios, this paper transforms the diffusion model from "explicit noise prediction" to "direct regression of the fused image." It incorporates a "Joint Observation Correction" step within DDIM sampling, which integrates dual degradation constraints and fusion constraints into a single matrix. This allows simultaneous restoration and fusion within a few sampling steps, significantly outperforming "restoration-then-fusion" cascaded schemes across various degradation scenarios on M3FD and Harvard datasets.

DetectSCI: Toward Object-Guided ROI Reconstruction for High-Resolution Video Snapshot Compressive Imaging

Addressing the pain points of high-resolution video Snapshot Compressive Imaging (SCI), where "full-frame reconstruction consumes excessive memory while backgrounds dominate but lack information," DetectSCI proposes a workflow that performs object detection directly on encoded measurements and reconstructs only the Regions of Interest (ROI) based on detected boxes. Its detector utilizes weight-sharing Mamba-Implicit modules to counter spatio-temporal aliasing and Frequency Mamba to recover suppressed high-frequency details, achieving 80.9 AP on a modified SportsMOT SCI dataset, outperforming the best CNN detector by \(\ge 2.8\) AP and the best Transformer detector by \(\ge 4.1\) AP.

PNG: Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning

PNG proposes using learnable Global/Local Prompt components to automatically extract noise features from real noise (replacing metadata like ISO/camera model). By encoding noise into a latent space via a Prompt AutoEncoder and using Prompt DiT (based on consistency models) for single-step latent code generation, it achieves metadata-free real sRGB noise synthesis. Downstream DnCNN denoising on SIDD lags behind real data by only 0.08dB.

Disentangled Textual Priors for Diffusion-based Image Super-Resolution

DTPSR is proposed to achieve superior perceptual quality in diffusion-based super-resolution by disentangling textual priors across two dimensions: spatial hierarchy (global/local) and frequency semantics (low/high), integrated through a decoupled cross-attention pipeline and a multi-branch CFG strategy.

Disentanglement-wise Image Dehazing through Cross-Domain Manifold Consensus

This paper operationalizes the hypothesis that "hazy image features across different perceptual domains (spatial, frequency, non-local, diffusion, compressive sensing) share the same scattering semantic core" into a Cross-domain Invariant Manifold (CIM). Using contrastive learning driven by consensus density, multi-domain features are aligned into a unified latent space. Additionally, a physically-guided HSV disentanglement network is integrated to specifically decouple color channel interference caused by haze. This approach simultaneously addresses "haze feature misjudgment" and "color distortion," achieving SOTA performance on multiple real/synthetic benchmarks with the fastest inference speed (0.062s).

DNF-SR: Dual-Input and Negative-Aware Feature Fine-Tuning for Real-World Image Super-Resolution

DNF-SR feeds "noisy LR + original LR" dual-paths into an image editing diffusion model (Flux-Kontext) for one-step super-resolution at an intermediate timestep. It further employs Negative-aware Feature Fine-Tuning (NF²T), which moves preference optimization from the latent space to the image/feature space, achieving state-of-the-art results in no-reference metrics across four real-world SR benchmarks.

DPGF-Net: Dual-Prior Guided Fusion Network for Joint Assessment of Perceptual Quality and Semantic Consistency in AI-Generated Images

DPGF-Net utilizes the dual encoders of Re-IQA to extract a "distortion prior Qmap" and a "content prior Cmap" to decouple rendering distortions from semantic content. Combined with a single text template and a dual-path adaptive fusion of "local TCPGA + global FIM," the model assesses both "perceptual quality" and "text-image alignment" within a unified CLIP framework, achieving 11 first-place results across 12 metrics on three AGIQA datasets and cross-dataset benchmarks.

DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer

DreamSR utilizes a dual-branch "global + local" MM-ControlNet to inject patch-level text prompts into a FLUX-based (DiT) super-resolution model. Combined with a one-step restoration LoRA and receptive-field enhanced training, it specifically addresses the "over-generation caused by local patch and global prompt semantic mismatch" during patch-wise inference for ultra-high-definition (\(\ge\) 4K) images, achieving SOTA on no-reference metrics across multiple real-world datasets.

Dual Graph Regularized Deep Unfolding Network for Guided Depth Map Super-resolution

This paper proposes LapNet, which integrates a "row/column dual-graph Laplacian prior + deep implicit prior" into a unified variational model. Through ADMM, closed-form updates are derived and unfolded into an interpretable multi-stage network. It reduces graph construction complexity from \(O(H^3W^3)\) to \(O(H^3+W^3)\) while achieving SOTA performance in Guided Depth Super-Resolution (GDSR) with only 3.84M parameters.

Dynamic Exposure Burst Image Restoration

DEBIR integrates "predicting optimal exposure time for each burst frame" as a learnable module into the burst restoration pipeline for the first time. BAENet predicts the exposure time for each frame based on preview images, gain, and motion magnitude. A burst simulator differentiable with respect to exposure time connects it with the restoration network for end-to-end training. In low-light scenarios, the restoration PSNR is 0.28 dB higher than that of fixed exposure settings, and the effectiveness is validated on a real dual-camera system.

Edit-aware RAW Reconstruction

Addressing the mismatch where the true objective of RAW reconstruction is downstream post-editing while existing methods only optimize pixel-wise RAW fidelity, this paper proposes a plug-and-play edit-aware loss. By utilizing a differentiable, modular, and randomly parameterized simplified ISP to render both ground truth and reconstructed RAW to sRGB for error calculation, the reconstruction results become more robust under various rendering styles and edits, achieving an sRGB PSNR gain of up to 1.5–2 dB under multiple editing conditions.

Enhancing Unregistered Hyperspectral Image Super-Resolution via Unmixing-based Abundance Fusion Learning

For the super-resolution task involving "low-resolution hyperspectral image (LR HSI) + one unregistered high-resolution reference image," this paper utilizes spectral unmixing to decouple spatial and spectral information. This allows the network to focus solely on enhancing the unmixed abundance maps (rather than performing direct spatial-spectral coupled fusion, which is susceptible to misalignment interference). Combined with coarse-to-fine deformable aggregation, spatial-channel abundance cross-attention, and modulated fusion modules, the method achieves SOTA performance on ICVL/REAL datasets with approximately half the parameters (PSNR 41.84/42.05 dB at \(\times 4\)).

Event-Based Motion Deblurring Using Task-Oriented 3D Gaussian Event Representations

To address the pervasive issue in existing event-based deblurring where "fixed-weight kernels" are used to aggregate sparse events into event frames, failing to adapt to local motion variations, this paper proposes a learnable 3D Gaussian event representation. It adaptively samples spatiotemporal coordinates based on blurry image content and event density, aggregates events using 3D Gaussian kernels, and employs a two-stage fusion network (local detail enhancement + 1D Gaussian global alignment). The method consistently outperforms SOTA on GoPro, HS-ERGB, and REBlur datasets in terms of PSNR.

Event-Illumination Collaborative Low-light Image Enhancement with a High-resolution Real-world Dataset

EIC-LIE enables event signals (providing HDR details) and image illumination priors (providing global brightness) to synergistically enhance through a bidirectional interaction module featuring "forward aggregation + backward injection with reused attention matrices." It employs image illumination statistics to drive a dynamic event filter for noise suppression and introduces RLE, the first \(1024 \times 768\) high-resolution real-world event low-light dataset. It outperforms SOTA by over 1.24dB PSNR across five datasets.

EVLF: Early Vision-Language Fusion for Generative Dataset Distillation

Ours proposes EVLF, a plug-and-play method for early vision-language fusion at the encoder-backbone interface, addressing text over-dominance and visual fidelity degradation caused by late semantic injection in diffusion model dataset distillation.

ExpoCM: Exposure-Aware One-Step Generative Single-Image HDR Reconstruction

ExpoCM models single-image HDR reconstruction as an exposure-aware consistency model trajectory. By using soft exposure masks to categorize LDR regions into overexposed, underexposed, and normal, distinct PF-ODE consistency trajectories are designed for each (hallucinating details from pure noise for overexposed regions, injecting low-frequency priors for underexposed regions, and using the input directly for normal regions). Combined with an exposure-weighted luminance-chrominance loss in the CIE L*a*b* space, it achieves SOTA fidelity via distillation-free, single-step inference, operating over 400x faster than DDPM.

FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

FAPE-IR utilizes a frozen multimodal large language model (Qwen2.5-VL) as a "planner" to interpret degraded images and generate frequency-aware restoration plans. An execution stage utilizing a LoRA-MoE within a diffusion framework dynamically schedules high- and low-frequency experts based on these plans. Combined with adversarial training and frequency regularization, the method achieves SOTA performance across seven restoration tasks and exhibits strong zero-shot generalization to unseen composite degradations.

FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution

This work proposes a fine-grained perceptual reward model, FinPercep-RM, and a Co-evolutionary Curriculum Learning (CCL) strategy to address reward hacking and training instability in RLHF-based real-world super-resolution. It achieves local defect awareness by simultaneously outputting global quality scores and spatial degradation heatmaps.

FoundIR-v2: Optimizing Pre-Training Data Mixtures for Image Restoration Foundation Model

FoundIR-v2 discovers that the "training data mixture ratio of different restoration tasks" is a key variable determining all-in-one image restoration performance. Consequently, it employs a dual-scheduling scheme—Dynamic Equilibrium Scheduling (dynamic ratio adjustment) and an MoE-driven Diffusion Scheduler (task-adaptive generative prior allocation)—for generative pre-training on SDXL. A single model covers 50+ sub-tasks and outperforms existing SOTA on multiple benchmarks.

From Events to Clarity: The Event-Guided Diffusion Framework for Dehazing

EvDehaze introduces event cameras to the dehazing task for the first time, remodeling dehazing as "event-conditioned image generation." By injecting high dynamic range (HDR) edge/contrast cues from events into latent space DDIM diffusion via cross-attention, it generates more realistic and clear dehazed images without requiring real-world paired supervision. It includes the first real-world RGB-Event drone dataset for foggy conditions.

Gaussian Splatting-based Low-Rank Tensor Representation for Multi-Dimensional Image Recovery

This work integrates Gaussian Splatting from 3D reconstruction into the t-SVD framework: 2D Gaussian Splatting is used to generate the latent tensor, while 1D Gaussian Splatting generates the transform matrix. This results in GSLR, a continuous, compact representation capable of capturing local high-frequency details. Based on this, an unsupervised multi-dimensional image recovery model is established, comprehensively outperforming SOTAs in PSNR/SSIM across random, tubal, and slice-wise missing patterns.

GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution

To address the issue where the deterministic output of one-step Real-ISR prevents preference optimization, this paper first employs "controllable noise injection + unequal timesteps" to enable one-step models to generate diverse candidates. It then merges DPO's pixel-level constraints with GRPO's intra-group relative advantage into GDPO, integrated with an attribute-aware reward function dynamically weighted by image smoothness/texture proportions. This approach enhances both fidelity and perceptual quality without increasing any inference overhead.

GSNR: Graph Smooth Null-Space Representation for Inverse Problems

The authors propose Graph Smooth Null-Space Representation (GSNR), which utilizes spectral graph theory to construct a null-space constrained Laplacian and selects the \(p\) smoothest spectral modes as projection bases. This provides structured null-space constraints for inverse problem solvers such as PnP, DIP, and diffusion models, achieving improvements of up to 4.3dB PSNR in deblurring, compressed sensing, demosaicing, and super-resolution.

Gyro-based Deep Video Deblurring

GyroDVD is the first learning-based framework for "gyro-assisted video deblurring." It utilizes a decomposed camera motion model to split per-pixel movement into rotational (measured by gyro) and translational (estimated via optical flow) components to construct per-pixel blur kernels. These kernels guide an image encoder and video decoder to restore clean video. It significantly outperforms all prior gyro-based image/video deblurring methods on the large-scale real-world dataset, GyroVD.

HFR and HDR Video from Multi-Attenuated Spikes Using a Rapidly Rotating SpokeND Filter

A rapidly rotating spoke-patterned neutral density filter (SpokeND) is placed in front of a spike camera, allowing each pixel to periodically sample light intensity at multiple attenuation levels. A two-stage ReST-Net (ReGain for spatial de-attenuation + ReFine for temporal flicker suppression) is then used to reconstruct high frame rate (HFR, up to 2000 FPS) and high dynamic range (HDR) video from these "multi-attenuated spikes."

Human-Centric Multi-Exposure Fusion: Benchmark and Bi-level Cognition Distillation Framework

This paper introduces human electroencephalogram (EEG) cognitive signals into multi-exposure fusion (MEF): first by constructing Cog-Expo, the first paired MEF-EEG dataset, and then employing "bi-level optimization" to distill cognitive knowledge from an EEG-guided Teacher to a Student that only uses images and requires no EEG during inference, achieving SOTA results on MEF benchmarks with fusion results more aligned with human perception.

Hybrid Agents for Image Restoration

To address the pain points of "non-experts being unable to select the right tools" and "sequential restoration causing error propagation" in real-world image restoration, HybridAgent is proposed. It employs a triad of "Fast, Slow, and Feedback" agents for collaborative scheduling, working with a suite of single and mixed degradation restoration tools trained in three stages. The system routes simple instructions through a lightweight fast path while processing complex degradations via an MLLM-based slow path with closed-loop feedback, achieving both efficiency and stability in automated image restoration.

IAFMNet: Information-Aware Feature Modulation for Efficient Super-Resolution

IAFMNet quantifies the "uneven information distribution across image regions" into an Information Density Map (IDM) using information theory. This IDM drives a dual-branch network featuring sparse convolution and affine modulation, concentrating computational power on "difficult-to-reconstruct, information-dense" areas like textures and edges, achieving superior reconstruction quality with lower FLOPs compared to other efficient SR methods of similar scales.

IFCSR: Inference-Free Fidelity-Realism Control for One-Step Diffusion-based Real-World Image Super-Resolution

IFCSR shifts "fidelity vs. realism" adjustment from the latent space of diffusion models to the image space. By generating a high-fidelity image and a highly realistic image using two specialized networks, users can linearly blend the two in the image space with a single parameter \(\gamma\). This allows for arbitrary sliding across the fidelity-realism spectrum without any additional network inference.

Flickerformer: A Duet of Periodicity and Directionality for Burst Flicker Removal

This work reveals that flicker artifacts possess two inherent physical properties: periodicity and directionality. It designs the Flickerformer with three modules (PFM/AFFN/WDAM) to model inter-frame/intra-frame periodicity and directionality respectively. With only 3.92M parameters, it achieves 31.226dB PSNR on the BurstDeflicker benchmark, surpassing the runner-up AST by +0.580dB using only 19.70% of its parameters.

Language-Guided One-Step Diffusion Model for Nighttime Flare Removal

Addressing the issue where nighttime flares from strong light sources occlude local areas and existing methods lack semantic priors for these regions—leading to artifacts or lost details—this paper introduces Flare-VLM, the first flare-specific Vision-Language Model. Flare-VLM outputs structured descriptions to guide a one-step diffusion model for reconstruction in a single forward pass. Furthermore, it proposes Semantic-Aware Distribution Distillation (SADD) to concentrate noise in flare regions and an instruction-driven data synthesis pipeline for realistic training data. This approach outperforms existing methods in both restoration quality and downstream detection.

Learned Image Compression via Sparse Attention and Adaptive Frequency

SAAF utilizes a "spatial-frequency dual-path" transform network for learned image compression. The spatial path employs Cross-Sparse Window Attention (CSWA) to efficiently model long-range dependencies with minimal global tokens, while the frequency path replaces fixed wavelet transforms with content-adaptive frequency reweighting (AFB). A Denoising-as-Regularizer (DaR), active only during training, smoothens the latent space. SAAF achieves SOTA BD-rate on Kodak/CLIC/Tecnick benchmarks with the lowest latency (67 ms).

Learning Personalized Photographic Style from Pairwise User Preferences

This paper defines a new task called PPS (Personalized Photographic Style), which involves learning personalized aesthetic preferences from pairwise user judgments and applying them to new photos. The authors provide the PPSD dataset (approximately 60,000 preference judgments from 767 users), three baseline models, and a specialized evaluation metric, CQS, which balances "fidelity" and "preference alignment." The study demonstrates that it is feasible to learn individual aesthetics from purely comparative signals.

LF-BVN: Blind-View Network for Self-Supervised Light Field Denoising

The "blind-spot" ideology from single-image denoising is extended to "blind-view" for light fields (LF). By masking a subset of views and reconstructing them using the multi-view consistency of the remaining views, the network is trained without clean images. A Geometric Invariant Mask (GIM) enables a single weight-shared network to denoise all views. The method achieves or surpasses supervised counterparts on synthetic, real-world, and microscopic light fields.

Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling

Addressing the issue in Blind Image Quality Assessment (BIQA) where "blindly merging all layer features introduces noise," Life-IQA utilizes only the features from the two deepest layers of the backbone for quality decoding. It employs a GCN-enhanced query topology to treat stage 4 features as queries and stage 3 features as keys/values for cross-layer interaction. Subsequently, a post-posed MoE head decouples features based on distortion types, achieving SOTA performance on seven BIQA benchmarks with approximately 95M parameters.

LightRR: A Lightweight Network for Single Image Reflection Removal

To address the issues of excessive size and slow speed in Single Image Reflection Removal (SIRR) models, LightRR employs wavelet frequency division to process low-frequency components (where reflections are concentrated) using a Mamba State Space Model, while high-frequency components pass through a lightweight bypass. During training, a knowledge distillation strategy allows a small encoder to learn from large pre-trained models and then discard them for inference. LightRR achieves near-SOTA performance using only 3.01% of the parameters and 5.22% of the FLOPs compared to RDNet.

LRHDR: Learning Representation-enhanced HDR Video Reconstruction

LRHDR reconstructs HDR video from LDR video frames with alternating exposures. It replaces the traditional "align-then-fuse" paradigm with a "map-to-unified-representation and vote-to-fuse" approach: it uses the ACCR network to align features from different exposures into an exposure-agnostic unified representation domain via pixel-wise affine modulation, while the APSWF reformulates fusion as pixel-wise sparse candidate selection. It achieves State-of-the-Art (SOTA) in PSNR/SSIM for both two-exposure and three-exposure settings.

MMDIR: Multimodal Instruction-Driven Framework for Mixed-Degradation Document Image Restoration

MMDIR integrates the process of "inquiring the model via text instructions about the presence/types of degradations in a document image" into the restoration pipeline. A degraded document image is paired with a text instruction; after joint processing by a vision encoder and an LLM, the LLM first outputs a diagnostic text identifying the existing degradations. These semantic features then guide the vision decoder for targeted restoration. This allows for the unified handling of four types of mixed and uncertain degradations—blur, shadow, text watermark, and seal—without relying on degradation priors or training separate models for each type.

MR. Illuminate: Zero-Shot Low-Light Image Enhancement with Diffusion Prior

MR. Illuminate utilizes a completely frozen, zero-training, and zero-optimization pre-trained diffusion model (SD v1.5) for low-light enhancement. It performs DDIM inversion on the input, applies AdaIN to align the statistics of the inverted latents with the standard normal distribution expected by the model for global luminance/color correction (Modulate), and injects self-attention features recorded during inversion into the sampling stage to restore local structures and colors (Refine). Without any auxiliary losses, degradation assumptions, or parameter tuning, it outperforms SOTA methods on standard benchmarks and maintains color constancy under varying illumination for the same scene.

Multi-Scale Gradient-Guided Unrolling Architecture with Adaptive Mamba for Compressive Sensing

MambaCS unrolls the classic Proximal Gradient Descent (PGD) algorithm into a U-shaped deep network across multiple feature scales. It replaces traditional convolution/Transformer modules in unrolling networks with a customized Adaptive State-Space Block (A-SSB) and redesigns the High-Dimensional Gradient Fusion (HDGF) and Feature-Adaptive Proximal Operator (FAPO). It achieves SOTA PSNR/SSIM on multiple compressive sensing reconstruction datasets with comparable parameter counts.

Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex

Multinex reformulates Retinex decomposition from a "reconstruction target" into an "additive residual prior." By feeding a set of analytically computed multi-view luminance/chrominance priors into two ultra-lightweight fusion networks, it outperforms SOTA lightweight models and approaches million-parameter models using only 45K (or even 0.7K) parameters across seven low-light benchmarks.

NEC-Diff: Noise-Robust Event–RAW Complementary Diffusion for Seeing Motion in Extreme Darkness

Proposes NEC-Diff, a diffusion-based event-RAW hybrid imaging framework that utilizes illumination priors from RAW images to guide event denoising and high dynamic range edges from events to assist image denoising. By combining dual-modal SNR-guided reliable information extraction and cross-modal attention diffusion, it achieves high-quality dynamic scene reconstruction in extreme darkness (0.001-0.8 lux) with a PSNR of 24.51 dB on the REAL dataset.

Next-Scale Prediction: A Self-Supervised Approach for Real-World Image Denoising

Inspired by "Next-Scale Prediction" in visual autoregression, NSP allows a Blind-Spot Network (BSN) to take low-resolution, decorrelated sub-images (derived from a large downsampling factor) as input to predict high-resolution, detail-preserving targets (corresponding to a small downsampling factor). This decouples "noise decorrelation" and "detail preservation"—two traditionally conflicting objectives—across different scales, achieving self-supervised SOTA on real-world denoising benchmarks while providing noise-aware super-resolution for free.

One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution

Based on Qwen-Image, the One-Step Diffusion Transformer (ODTSR) utilizes "Noise-Mixed Visual Streams" (NVS) to achieve simultaneous fidelity and prompt controllability, continuously adjustable via a fidelity weight \(f\). Combined with "Fidelity-Aware Adversarial Training" (FAA) to compress multi-step denoising into single-step inference, it achieves SOTA performance in both general Real-ISR and Chinese/English scene text SR.

Outlier-Robust Diffusion Solvers for Inverse Problems

Aiming at outliers commonly found in real-world measurements, this paper introduces two safeguards to inverse problem solvers based on pre-trained diffusion models—first refining measurements with explicit noise estimation, then replacing the squared \(\ell_2\) data fidelity term with Iteratively Reweighted Least Squares using Huber loss, solved via gradient descent (Robust-GD) and conjugate gradient (Robust-CG) respectively. It proves significantly more stable than recent methods like DPS and DAPS under outlier contamination for linear and non-linear tasks such as super-resolution, inpainting, and deblurring.

PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors

The PhaSR framework is proposed to achieve generalized shadow removal—ranging from single-source direct shadows to multi-source ambient light scenes—through dual-level physical prior alignment. This includes global-level PAN, which performs parameter-free Retinex decomposition to suppress color bias, and local-level GSRA, which utilizes differential attention to align DepthAnything geometric priors with DINO-v2 semantic embeddings. PhaSR achieves SOTA performance on WSRD+ and Ambient6K with the lowest FLOPs.

Physics-Guided Multistep Deformation Reversal for Ancient Bamboo Slip Restoration

To address the complex non-linear deformations of excavated ancient bamboo slips caused by dehydration stress, this paper utilizes wood rheology to establish a computable "forward physical deformation engine" for generating unpaired training data. It then trains a ControlPointUNet to progressively predict reverse displacement fields, "bending" the bamboo slips back to their original state step-by-step. The method significantly outperforms data-driven approaches such as CycleGAN, DewarpNet, and DDRM in terms of text fidelity (TRQ) and physical deformation consistency (DCI).

PnP-CM: Consistency Models as Plug-and-Play Priors for Inverse Problems

Consistency Models (CM) are reinterpreted as "proximal operators of a prior" and integrated into an ADMM-based Plug-and-Play (PnP) framework. By incorporating noise injection and momentum to compress iterations to 2–4 Number of Function Evaluations (NFE), this work unifiedly solves linear/nonlinear inverse problems and applies CM training to MRI reconstruction for the first time.

Polarization State Tracing for Reflection Removal and Color-Consistent Reconstruction

Addressing the overlooked degradation problem of "ghosting artifacts + color bias" when photographing through colored glass, this paper introduces polarization imaging theory into modeling for the first time. It proposes a physical imaging model, PSTM (tracing multi-path propagation of polarized light + wavelength-selective absorption), and designs a polarization-aware network, PANet, featuring Channel Ring Attention. On the self-built real-world dataset GlassPol, it achieves an improvement of approximately 3dB PSNR over existing SOTA methods while reconstructing transmission scenes with high color fidelity.

RADAR: VQ-VAE Decoder of VAR is a Good Student for Restoring Against Degradation by Acceleration

To address the issue of latent representation degradation and decreased image quality in Visual Autoregressive (VAR) models after acceleration, this paper proposes RADAR, a two-stage framework. First, the Semantic Cost-Aware Mask (SCA-Mask) converts attention pruning into an optimization problem of "retaining maximum semantic information under budget constraints." Second, Post-Acceleration Adaptation (PAA) utilizes a LoRA attached to the VQ-VAE decoder and uses the unaccelerated branch as a teacher for internal knowledge distillation to restore degraded latent representations into high-fidelity images. This achieves approximately 1.6–1.9× speedup on ImageNet-1K with almost no loss in FID (restoring VAR-d20 from a degraded 5.02 back to 2.68, compared to the original 2.61).

RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution

It is demonstrated that carefully designed device-specific degradation modeling (obtaining real blur and noise parameters via calibration) significantly improves the real-world performance of smartphone super-resolution. By unprocessing public rendered images into the RAW domain of different smartphones to generate high-low resolution training pairs, the trained SR models significantly outperform baselines trained with massive arbitrary degradation combinations on held-out real device data.

RawMetaDiff: Unlocking Extreme Darkness from Dual-Exposure RAW with Meta-Guided Diffusion

RawMetaDiff reframes the fragile explicit registration of "short/long exposure frames" as a "conditional generation" problem. It uses noisy short-exposure RAW as the diffusion initialization, references a potentially misaligned long-exposure RAW, and is guided by RAW metadata (ISO/CCM/exposure) for one-step latent diffusion. Utilizing MACT for global color transfer and MNCA for shadow detail injection, it achieves a 33% LPIPS improvement on synthetic data and a 15% DeQA gain on real data.

Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset

Ours proposes Real-IISR, a unified autoregressive framework that addresses the unique challenges of real-world infrared image super-resolution (IISR) through a Thermal-Structural Guidance module, a Conditional Adaptive Codebook, and a Thermal Ordering Consistency loss, while constructing the FLIR-IISR dataset (1457 pairs of real LR-HR infrared images).

Reflection Separation from a Single Image via Joint Latent Diffusion

Addressing the difficulty of simultaneously restoring transmission and reflection layers in extreme scenarios like strong glare or weak reflections, this paper fine-tunes a latent diffusion model to simultaneously generate both layers using a unified model with "Transmission / Reflection" prompts. Combined with cross-layer self-attention, disjoint sampling, and test-time latent synthesis optimization, it achieves SOTA quality for both transmission and reflection across multiple real-world benchmarks (e.g., Real20 PSNR 25.32, reflection layer LPIPS reduced from 0.52 to 0.37).

ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation

ReflexSplit proposes an explicit layer fusion-separation framework. It adaptively aggregates multi-scale features through Cross-scale Gated Fusion (CrGF) and employs differential dual-dimensional attention \(\mathbf{A}^t - \lambda_\ell \mathbf{A}^r\) within the Layer Fusion-Separation Block (LFSB) for cross-stream interference suppression. Combined with a curriculum training strategy utilizing depth-dependent initialization and epoch-wise warmup, it achieves SOTA performance on both synthetic and real-world reflection separation benchmarks.

RegionFuse: Region-Adaptive Pixel Distribution Learning for Infrared and Visible Image Fusion

RegionFuse refines the fusion weights of Infrared-Visible Image Fusion (IVIF) from "global uniformity" to "region-adaptive according to local pixel distribution." By utilizing a region-level Mixture-of-Region Attention (MoRA) to dispatch regions with different pixel distributions to various masked attention experts, and enhancing effective regions while suppressing redundancy via a Region Feature Compression Module (RFCM), it achieves SOTA performance on four IVIF benchmarks and exhibits particular robustness against non-uniform illumination and exposure.

Restore Text First, Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance

TiGeSR decouples the inherent trade-off between "image quality" and "text readability" in scene text super-resolution using a "restore text first, enhance image later" two-stage paradigm. It first reconstructs precise glyph structures in text regions via a diffusion model, then injects these glyphs as conditions into ControlNet for full-image super-resolution. The authors also released UZ-ST, the first Chinese scene text dataset with a maximum zoom of \(\times 14.29\), achieving SOTA in both image quality and OCR accuracy on Real-CE and UZ-ST.

Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features

DGAF-VSR revisits the role of "alignment + compensation" in diffusion-based video super-resolution (VSR). Based on two quantitative observations—that the feature domain exhibits stronger spatio-temporal correlation than the pixel domain, and that warping at high resolutions better preserves high-frequency details—the authors design the OGWM module for "up-warp-down" alignment in the feature domain and the FTCM module using a full U-Net for dense temporal guidance. The method outperforms SOTAs across perceptual quality, fidelity, and temporal consistency (DISTS \(-35.82\%\), PSNR \(+0.20\)dB, tLPIPS \(-30.37\%\)).

Rethinking Knowledge Transfer in Image Quality Assessment: A Perceptual Preference Structure Alignment Perspective

The authors attribute the failure of cross-dataset transfer in IQA to the mismatch of "perceptual preference structure" (i.e., differences in conditional distributions \(P(Y|X)\) across datasets). They propose Perceptual Preference Representation (PPR) as a feature-score correlation vector to quantify these preferences, Perceptual Preference Compatibility (PPC) using cosine similarity to measure dataset compatibility, and a greedy pruning strategy (PreSTA) to select source samples aligned with the target domain. Using only 20% of source data, this method outperforms full-data baselines.

Retrieve-to-Restore: Efficient All-in-One Image Restoration with a Retrieval-Based Degradation Bank

R2R decouples "degradation adaptation" from the backbone by offloading it to an external retrievable "degradation bank." During training, a degradation amalgamator distills clean priors from various degradations into the bank; during inference, degradation matching retrieves the most relevant priors to modulate features. This allows a single lightweight backbone to handle multiple degradations stably, matching SOTA PSNR while using only 9% of the computational cost.

Scan Clusters, Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration

Addressing the bottleneck where Mamba still requires pixel-wise scanning in Ultra-High-Definition (4K) image restoration, leading to memory explosion, C2SSM replaces "pixel-serial scanning" with "cluster-centric scanning." The method distills millions of pixels into a few semantic centroids via a neurally-parameterized mixture distribution, performs Mamba scanning only on these centroids, and diffuses global context back to all pixels based on similarity distributions. This paradigm achieves SOTA performance across five UHD restoration tasks with the lowest FLOPs (0.407G).

SDUIE: Semi-Supervised Diffusion for Underwater Image Enhancement with Quant-Text Dual Control

To address the issue where existing underwater image enhancement methods only provide fixed outputs despite varying user preferences, SDUIE proposes a semi-supervised dual-branch diffusion framework. It enables continuous numerical adjustment via a fusion factor \(\alpha\) (SDUIE-Quant) and semantic adjustment via natural language prompts (SDUIE-Text), achieving SOTA performance while preserving underwater aesthetic tones.

DeSpike: Defocus Deblurring and Image Reconstruction for Spike Camera

DeSpike is the first end-to-end deblurring and reconstruction framework specifically designed for spike camera defocus blur. It first characterizes how defocus distorts spike firing using a thin-lens physical model, then restores clear images from blurred spike streams using multi-temporal scale IF neurons, learnable discrete PSF priors, and multi-spatial scale iterative refinement. It significantly outperforms existing deblurring methods on both synthetic and real defocused spike data.

Self-Diffusion Driven Blind Imaging

DeblurSDI extends "self-diffusion" (a reverse problem solver requiring no pretraining) from non-blind scenarios with known degradation operators to blind scenarios. Using two randomly initialized networks without pretraining, it simultaneously reconstructs the sharp image and the Point Spread Function (PSF) during a reverse diffusion process starting from pure noise. The noise scheduling naturally stabilizes the joint optimization, which is typically prone to collapse, significantly outperforming existing blind deblurring methods on optical aberrations and motion blur.

Self-supervised Dynamic Heterogeneous Degradation Modeling for Unified Zero-Shot Image Restoration

UP-ZeroIR identifies that heterogeneous degradations such as noise, haze, and low-light can be characterized by a two-parameter Generalized Gaussian Distribution (GGD) in latent space. Consequently, degradation modeling is reformulated as a "distribution alignment" problem, integrated with a strategy for self-assessing quality and dynamically adjusting the sampling trajectory. This allows the pre-trained diffusion model to set new SOTA results in both single and mixed degradation restoration under zero-shot settings without retraining.

SelfHVD: Self-Supervised Handheld Video Deblurring

SelfHVD utilizes naturally existing sharp frames in handheld videos as supervisory signals. Through Self-Enriched Video Deblurring (SEVD) for constructing high-quality training pairs and Self-Constrained Spatial Consistency Maintenance (SCSCM) to prevent displacement shifts, it achieves handheld video deblurring without paired data.

ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration

ShiftLUT is proposed to achieve the largest receptive field (65×65) among LUT-based methods via a Learnable Spatial Shift (LSS) module. Combined with an asymmetric dual-branch architecture and Error-bounded Adaptive Sampling (EAS), it outperforms all existing LUT methods with 104KB storage and 84ms inference latency.

ShreddingNet: Coarse-to-Fine Restoration for Multi-Source Shredded Manuscripts

ShreddingNet utilizes a two-stage pipeline—"coarse matching for source-based clustering followed by fine-grained pair-wise scoring"—to solve the restoration problem where fragments from multiple paintings are mixed. It reduces model complexity from \(O(n^2)\) to \(O(n)\), achieving a global assembly F1-score of 98.37% on two datasets, outperforming the previous SOTA by 5.72%.

Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor

Addressing the ill-posed nature of single-frame RGB deblurring and the issues of saturation and edge/motion entanglement in event cameras, this paper utilizes the high-frame-rate Spatial Difference (SD, encoding structural edges) and Temporal Difference (TD, encoding motion) captured synchronously by the Tianmouc Complementary Vision Sensor (CVS) within a single RGB exposure. The authors design STGDNet, a recursive multi-branch network that injects SD/TD into the RGB feature space sequentially. Complemented by a DMD data pipeline for generating real aligned training pairs, the method achieves SOTA performance on both synthetic CVS datasets and over 100 real extreme motion scenarios.

Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization

Proposes UALNet, which achieves spectral super-resolution from Sentinel-2 multispectral data (12 bands) to NASA AVIRIS hyperspectral images (186 bands) by embedding data-driven spectral priors (PriorNet) and adversarial learning terms into a deep unfolding framework. It outperforms Transformers while requiring only 15% of the computation and 1/20 of the parameters.

Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging

Ours proposes the Statistical Characteristic-Guided Denoising Network (SCGN), which utilizes spatial window standard deviation weighting and frequency band-guided weighting to adaptively enhance signals and suppress noise in both spatial and frequency domains. Combined with an HRTEM-specific noise calibration method to generate realistic datasets of disordered structures, it achieves high-quality denoising of millisecond-scale High-Resolution Transmission Electron Microscopy images.

Task-Aware Image Signal Processor for Advanced Visual Perception

TA-ISP replaces the RAW→RGB step—traditionally either a heavy network or a few tuned parameters—with predicted sets of global, regional, and pixel-level modulation operators. At the cost of only 3K parameters and sub-27ms latency, it produces image representations optimal for downstream detection and segmentation, achieving superior accuracy while significantly reducing computation and latency across multiple RAW benchmarks.

The Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations

This paper discovers through systematic experimental analysis that pretraining INRs with unstructured noise (Uniform/Gaussian distributions) achieves a surprising ~80dB PSNR in image fitting, significantly outperforming all data-driven initialization methods. Meanwhile, noise with a spectral structure of \(1/|f^\alpha|\), matching natural images, achieves the best balance between signal fitting and denoising, matching SOTA data-driven initialization performance without requiring any real data.

TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking

Ours proposes the TIACam framework, which achieves a camera-robust zero-watermarking scheme without modifying image pixels. By employing a learnable auto-augmentor to simulate camera distortions, text-anchored cross-modal adversarial training to learn invariant features, and a zero-watermarking head to bind messages in the feature space, the method achieves SOTA extraction accuracy in three real-world scenarios: screen re-shooting, print re-shooting, and screenshots.

Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution

TADSR identifies that existing one-step diffusion SR methods fix the student's timestep at 999, wasting the diverse generative priors of Stable Diffusion (SD) across different timesteps. It introduces a time embedding to the VAE encoder, allowing the same image to encode different latents based on the timestep, and utilizes a mapping function to bind student and teacher timesteps. This enables consistent generative guidance in a single step and allows for seamless adjustment between fidelity and realism by simply tuning \(t_s\), achieving SOTA performance on non-reference metrics across multiple real and synthetic datasets.

Time-Specialized Event-Image Alignment for Blur-to-Video Decomposition

TSANet leverages event cameras to "unfold" a single motion-blurred image into a high-frame-rate sharp video. The core mechanism involves "time-specializing" both event and image features to align them with an arbitrary query timestamp \(t\) before lightweight fusion, consistently outperforming previous SOTA methods on GoPro, HighREV, and EBD datasets.

Time Without Time: Pseudo-Temporal Representation for Space-Time Super-Resolution

Addressing the lack of effective pre-training strategies for Space-Time Video Super-Resolution (STVSR), this paper proposes a method that "copies a single image into multiple frames + applies independent random zeroing per frame" to forge a video without real time. This allows the target STVSR network to undergo pre-training by reconstructing clean high-spatio-temporal resolution outputs from degraded pseudo-temporal inputs. A difficulty-adaptive pixel loss is used to focus on hard-to-generate regions. This architecture-agnostic, image-only pre-training improves the PSNR of various STVSR networks by up to +5dB under few-shot fine-tuning.

TM-BSN: Triangular-Masked Blind-Spot Network for Real-World Self-Supervised Image Denoising

Ours proposes the Triangular-Masked Blind-Spot Network (TM-BSN), which aligns the blind-spot shape precisely with the diamond-shaped spatial correlation patterns of real sRGB noise. It achieves self-supervised image denoising at the original resolution without downsampling and further enhances performance through knowledge distillation, reaching SOTA on SIDD and DND benchmarks.

Towards Generalized Representations for Low-Light Understanding: When Signal Constancy Meets Semantic Enrichment

UniPrior unifies the "illumination-invariant signal prior" with the "semantic prior from Visual Foundation Models (DINOv2/CLIP)." Without using any real low-light data for training, it enables models trained on daytime data to robustly generalize to unseen nocturnal/low-light scenes, significantly setting new zero-shot SOTAs across classification, segmentation, and face detection tasks.

TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution

Addressing the failure of diffusion models like SD (with a native resolution of \(512^2\)) in \(\times 8\) high-magnification super-resolution (e.g., \(256^2 \to 2048^2\)), TUDSR decomposes "one-time high-magnification upsampling" into two stages of "upsampling-diffusion" that fall within the model's native capability. By using two serial LoRAs and a one-step GAN, it produces high-quality \(2048^2\) images on 4x RTX 4090 GPUs, achieving SOTA on perceptual metrics across multiple real-world datasets, particularly in \(\times 8\) tasks.

UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution

The authors propose UCAN, a lightweight super-resolution network that unifies convolutional and attention mechanisms to efficiently expand the effective receptive field. By introducing Hedgehog attention, it addresses the rank collapse problem in linear attention. The model incorporates a large kernel distillation module and a semi-sharing parameter strategy, achieving a 31.63 dB PSNR on Manga109 (4×) with only 48.4G MACs.

UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation

UDAPose achieves a 56.4% AP improvement on low-light hard sets through Stable Diffusion-based low-light image synthesis (preserving high-frequency low-light features) and a Dynamic Attention Control module (adaptively balancing visual cues with pose priors).

Towards Universal Computational Aberration Correction in Photographic Cameras: A Comprehensive Benchmark Analysis

This work constructs UniCAC, the first large-scale universal computational aberration correction (CAC) benchmark. It proposes the Optical Degradation Evaluator (ODE) to quantify aberration difficulty and provides a comprehensive evaluation of 24 image restoration and CAC algorithms. The study reveals how prior utilization, network architecture, and training strategies influence CAC performance.

UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration

UniLDiff constructs a unified image restoration framework using Stable Diffusion XL as a backbone. It employs "Degradation-Aware Feature Fusion (DAFF)" to dynamically inject low-quality features into the diffusion trajectory at each denoising step and a "Detail Expert Module (DAEM)" in the decoder via MoE to recover high-frequency details lost during VAE compression. It achieves SOTA perceptual quality in multi-task, composite degradation, and zero-shot real-world degradation scenarios.

UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization

The UniRain unified image deraining framework is proposed, which filters high-quality samples from million-scale public datasets through RAG-driven data distillation. Combined with an asymmetric MoE architecture and a multi-objective reweighted optimization strategy, it achieves consistently superior performance across four degradation types: rain streaks and raindrops (day/night).

VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba

VEMamba serves as the first application of Mamba to the isotropic reconstruction of Volume Electron Microscopy (VEM). By employing "Axial-Lateral Chunk-based Selective Scanning (ALCSSM) + Dynamic Weight Aggregation (DWAM)," it rearranges 3D voxel dependencies into 1D sequences for linear-complexity modeling. It incorporates degradation priors via realistic simulation and Momentum Contrast (MoCo). The model achieves SOTA performance on EPFL and CREMI datasets with the lowest parameter count and computational overhead.

VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution

The authors construct VoDaSuRe—the largest paired multi-resolution real CT dataset to date (\(\sim 194\) gigavoxels, 16 samples, 32 scans). It reveals a fact obscured by existing volumetric SR research: the "impressive performance" of current SOTA models stems primarily from training on synthetic downsampled data. Faced with physically acquired real low-resolution scans, these models output only spatially averaged blurry results and fail to reconstruct lost microstructures.

White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation

This work reformulates color constancy (light source estimation) from "direct RGB regression" into a closed-loop process of "white-balance first, then let VLM provide feedback for iterative correction." In each round, the current estimate is used to white-balance the image and convert it into a pseudo-sRGB format. A LoRA-finetuned VLM judges whether the image remains red/green/blue-tinted, driving the rotation of the light source direction toward the corresponding axis until convergence. Without requiring target camera calibration or retraining, it achieves SOTA performance on four cross-camera benchmarks, significantly reducing errors in the most difficult 25% of samples.

Zero-Shot Image Denoising via Hybrid Prior-Guided Pseudo Sample Generation

ZS-HPD trains a denoising network using training pairs generated from a single noisy image. It utilizes a "gradient-sorted grouping" downsampler to capture local priors and a "Gaussian-constrained global random sampler" to capture non-local self-similarity priors. Combined with a frequency-domain loss that weights high-frequency components, ZS-HPD outperforms existing methods like Pixel2Pixel in both performance and efficiency.

ZeroIDIR: Zero-Reference Illumination Degradation Image Restoration with Perturbed Consistency Diffusion Models

ZeroIDIR decomposes illumination degradation image restoration into two steps: "Adaptive Illumination Correction + Diffusion Reconstruction." It is trained exclusively on degraded images without any reference images or paired data. First, the Adaptive Gamma Correction Module (AGCM) shifts the exposure to a natural distribution. Then, this corrected result is treated as an intermediate noise state and fed into the Perturbed Consistency Diffusion Model (PCDM) for detail refinement and denoising. It achieves leading performance among unsupervised methods and shows generalization to unseen scenes that even surpasses supervised methods.