Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution¶
Conference: CVPR 2026
Paper: CVF Open Access
Code: https://github.com/Chanson94/CODSR
Area: Diffusion Models / Image Super-Resolution
Keywords: One-step Diffusion, Real-world Image Super-Resolution, Generative Prior Activation, Feature Modulation, Textual Alignment
TL;DR¶
CODSR performs real-world image super-resolution via one-step diffusion: it first utilizes "local noise injection" based on gradient maps to activate generative priors in textured regions, then employs uncompressed LQ features to modulate U-Net intermediate layers for fidelity restoration, and finally constrains cross-attention with noun masks from Grounded-SAM2 for textual alignment. It achieves superior perceptual quality and competitive fidelity across four real-world datasets.
Background & Motivation¶
Background: In recent years, the mainstream approach for Real-world Image Super-Resolution (Real-ISR) has been leveraging the strong generative priors of pre-trained text-to-image diffusion models (e.g., Stable Diffusion) to "hallucinate details." However, full diffusion requires dozens or hundreds of denoising steps, which is computationally expensive. Consequently, several one-step diffusion methods (OSEDiff, PiSA-SR, TVT, HYPIR, etc.) have emerged, distilling multi-step processes into a single-step forward pass.
Limitations of Prior Work: The authors identify three critical unresolved issues in existing one-step methods: 1. Poor Fidelity: LQ images are first compressed and encoded into the latent space via VAE; this compression inherently loses information, leading to structural deviations during reconstruction. 2. Non-partitioned Activation of Generative Priors: Existing methods directly feed the LQ latent into the denoising network, which deviates from the native "recovery from noise" mode of diffusion models. Moreover, treating all spatial regions equally leads to over-generated artifacts in flat regions and insufficient detail in textured/edge regions. 3. Textual Misalignment: Text prompts extracted using DAPE/RAM often fail to spatially align their influence in cross-attention with the actual semantic regions corresponding to the words, rendering textual guidance largely ineffective.
Key Challenge: There is a natural conflict between fidelity (staying true to the LQ structure) and reality (generating rich details). To create more details, more noise or generative priors must be released, which in turn disrupts the latent distribution and damages structural fidelity. Existing methods either sacrifice fidelity for perception or suppress generation to maintain structure.
Goal: To develop a one-step diffusion super-resolution method that can "controllably balance fidelity and reality" while addressing the three aforementioned weaknesses.
Core Idea: Instead of uniform noise injection across the image, the method activates generative priors differentially by region (injecting noise only where uncertainty exists), restores fidelity by modulating intermediate features with uncompressed LQ information, and ensures textual alignment via mask constraints. These three components correspond to the three identified limitations.
Method¶
Overall Architecture¶
CODSR is a one-step diffusion super-resolution network based on SD 2.1-base. The input is a degraded low-quality image \(x_L\), and the output is a high-quality reconstructed image. The entire pipeline consists of "one forward noise addition + one reverse denoising": \(x_L\) is first encoded into latent \(z_L\) via VAE; instead of direct denoising, RGPA constructs spatial-adaptive noise based on a Sobel gradient map to obtain a noisy latent \(z_t\) through a one-step forward pass. During denoising, shallow U-Net features are modulated by LQFM using uncompressed LQ features to recover fidelity information. Simultaneously, TMG uses noun masks from Grounded-SAM2 to constrain cross-attention, ensuring text act only on corresponding semantic regions. These three modules address their respective issues and are coupled via temporal coefficients \(\lambda_t\) and \(w_t\) to collaborate within a single step.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["LQ Image x_L"] --> B["VAE Encoder<br/>→ latent z_L"]
B --> C["Region-adaptive Generative Prior Activation (RGPA)<br/>Sobel Gradient → Adaptive Noise ε_a → One-step Forward to z_t"]
C --> D["U-Net One-step Denoising"]
A --> E["LQ-guided Feature Modulation (LQFM)<br/>pixel-unshuffle + Time-aware SFT modulation of shallow features"]
E --> D
A --> F["Textual Matching Guidance (TMG)<br/>Grounded-SAM2 Noun Mask constraints on cross-attention"]
F --> D
D --> G["VAE Decoder → HQ Reconstruction"]
Key Designs¶
1. Region-adaptive Generative Prior Activation (RGPA): Targeted noise injection by gradient map
Addressing "non-partitioned prior activation." Existing one-step methods feed \(z_L\) directly into the denoising network, deviating from the diffusion-native mode and resulting in artifacts in flat areas and insufficient texture. RGPA actively injects Gaussian noise into the latent before denoising, but the amount of noise is spatially adaptive. \(x_L\) is encoded to \(z_L\), then a one-step forward pass constructs the noisy latent: $\(z_t = \sqrt{\bar\alpha_t}\, z_L + \sqrt{1-\bar\alpha_t}\, \epsilon_a\)$ The reverse pass executes a single step, using U-Net predicted noise \(\epsilon_\theta(z_t,c,t)\) to recover the clean latent: $\(\hat z_H = z_L + w_t\big(\epsilon_a - \epsilon_\theta(z_t,c,t)\big),\quad w_t=\tfrac{\sqrt{1-\bar\alpha_t}}{\sqrt{\bar\alpha_t}}\)$ The key lies in the adaptive noise \(\epsilon_a\): a Sobel gradient map \(g_{x_L}\) is calculated for \(x_L\) and passed through a mapping operator \(\mathcal{W}(\cdot)\) (involving \(16\times16\) patch averaging and piecewise transformation) to obtain region-wise noise weights. Finally, \(\epsilon_a = \mathcal{W}(g_{x_L})\odot\epsilon\) where \((\epsilon\sim\mathcal N(0,I))\). High-frequency areas receive stronger noise to encourage generative exploration, while low-frequency flat areas receive weaker noise to preserve structure. The timestep \(t_s\in[t_{min},t_{max}]\) is randomly sampled during training for robustness; during inference, adjusting \(t_s\) allows for continuous sliding between fidelity and generation quality, providing "controllability."
2. LQ-guided Feature Modulation (LQFM): Restoring fidelity via uncompressed LQ information
Addressing "poor fidelity due to VAE compression." Loss of structural information during \(x_L \to z_L\) encoding is irreversible in the latent space. LQFM is a plug-and-play module that bypasses the latent and draws information directly from the raw LQ image. Pixel-unshuffle is applied to \(x_L\) to obtain features \(\tilde x_L\) without information loss. A time-aware Spatial Feature Transform (SFT) layer then modulates the output features \(f^m\) of the first U-Net convolutional layer: $\(\mathrm{SFT}(f^m\mid \tilde x_L) = (1+\lambda_t\gamma)\odot f^m + \lambda_t\beta\)$ Modulation parameters \((\gamma, \beta) = \mathcal{M}(\tilde x_L)\) are generated by a two-layer MLP. The temporal coefficient \(\lambda_t = 1/w_t\) aligns modulation intensity with the RGPA noise schedule. By modulating intermediate features \(f^m\) instead of the latent \(z_L\) itself, the method avoids disrupting the latent distribution (Ablation Table 4).
3. Textual Matching Guidance (TMG): Anchoring text to semantic regions via noun masks
Addressing "textual misalignment." TMG extracts prompts via RAM and removes abstract adjectives using NLTK, leaving only nouns \(\{n_1,\dots,n_N\}\). These nouns and \(x_L\) are fed into Grounded-SAM2 to obtain binary region masks \(\{M^1,\dots,M^N\}\). These masks explicitly define where text should act during the reverse process, using CoMat's positive area loss to constrain the alignment between cross-attention maps \(A^i\) and masks \(M^i\). Masks can be pre-calculated and cached offline, ensuring zero additional segmentation overhead during inference.
Loss & Training¶
Training follows a two-stage strategy. Stage 1 uses pixel-level content loss + LPIPS perceptual loss + GAN loss (from S3Diff) to establish realistic details. Stage 2 adopts the dual LoRA strategy from PiSA-SR: LoRAs from Stage 1 are frozen, and new LoRAs are added to U-Net cross-attention layers to enhance textual alignment, utilizing VSD loss (weight of 2) to distill semantic knowledge. The objective is: $\(\mathcal L = \mathcal L_{\text{OSEDiff}} + \eta_{pos}\mathcal L_{pos}\)$ Where \(\mathcal L_{pos}\) is the positive area loss from CoMat with \(\eta_{pos}=1\). VAE encoder and U-Net LoRAs are set to ranks 4 and 16 respectively; 4x 4090 GPUs, batch size 16, AdamW, \(lr=5e-5\).
Key Experimental Results¶
Main Results¶
CODSR was compared against full-step and one-step diffusion methods on four real-world datasets: RealSR, DrealSR, RealPhoto60, and RealDeg. Representative metrics from DrealSR and RealSR are shown below (PSNR/SSIM ↑ higher better, LPIPS/DISTS/NIQE ↓ lower better, MUSIQ/MANIQA/CLIPIQA+ ↑ higher better):
| Dataset | Metric | OSEDiff | PiSA-SR | TVT | HYPIR | Ours CODSR |
|---|---|---|---|---|---|---|
| DrealSR | PSNR↑ | 27.92 | 28.32 | 28.27 | 26.04 | 28.19 |
| DrealSR | LPIPS↓ | 0.2968 | 0.2959 | 0.2900 | 0.3356 | 0.2919 |
| DrealSR | DISTS↓ | 0.2165 | 0.2169 | 0.2205 | 0.2333 | 0.2108 |
| DrealSR | NIQE↓ | 6.49 | 6.17 | 7.03 | 6.39 | 5.97 |
| DrealSR | MUSIQ↑ | 64.65 | 66.11 | 65.56 | 61.03 | 67.05 |
| DrealSR | CLIPIQA+↑ | 0.5181 | 0.5290 | 0.5226 | 0.4885 | 0.5589 |
| RealSR | LPIPS↓ | 0.2920 | 0.2672 | 0.2597 | 0.3046 | 0.2741 |
| RealSR | MUSIQ↑ | 69.08 | 70.15 | 69.89 | 66.42 | 70.54 |
| RealSR | MANIQA↑ | 0.6331 | 0.6552 | 0.6232 | 0.6510 | 0.6727 |
CODSR leads in all no-reference metrics across all datasets: NIQE, MUSIQ, and CLIPIQA+ are higher by at least 0.15, 0.39, and 0.0133 respectively. Compared to full-step methods, MUSIQ on DrealSR and RealDeg improves by at least 1.96 and 5.15. While PODSR is not first in PSNR/SSIM, it achieves the best perceptual quality while maintaining competitive fidelity.
Ablation Study¶
Module-wise ablation (DrealSR, Table 2):
| Configuration | LPIPS↓ | MUSIQ↑ | CLIPIQA+↑ | Description |
|---|---|---|---|---|
| Base | 0.2906 | 65.89 | 0.5038 | No modules |
| + RGPA | 0.2914 | 66.26 | 0.5109 | Prior release, Perception ↑ |
| + RGPA & LQFM | 0.2902 | 66.27 | 0.5213 | Fidelity restored, CLIPIQA+ ↑ |
| Full (+ TMG) | 0.2919 | 67.05 | 0.5589 | TMG gives a significant jump |
Comparison of noise strategies (DrealSR, Table 3):
| Noise Strategy | PSNR↑ | LPIPS↓ | MUSIQ↑ | CLIPIQA+↑ | Description |
|---|---|---|---|---|---|
| Base | 28.39 | 0.2906 | 65.89 | 0.5038 | No noise |
| \(x_L+\epsilon_a\) (Image domain) | 28.29 | 0.2917 | 65.84 | 0.5020 | Ineffective activation |
| \(z_L+\epsilon\) (Standard Gaussian) | 28.04 | 0.2974 | 66.48 | 0.5133 | Perception ↑ but fidelity collapse |
| RGPA (Ours) | 28.24 | 0.2914 | 66.26 | 0.5109 | Optimal balance |
Key Findings¶
- Complementary Modules: RGPA primarily boosts no-reference perception, LQFM restores CLIPIQA+, and TMG provides the largest jump in semantic quality.
- Latent-Domain Adaptive Noise is Essential: Image-domain noise is ineffective due to misalignment with the noise pathway. Standard Gaussian noise in the latent space destroys fidelity; only RGPA's adaptive latent noise preserves both.
- Intermediate vs. Latent Modulation: Modulating \(z_L\) in LQFM destroys its diagonal Gaussian distribution, significantly hurting no-reference metrics. SFT outperforms element-wise addition.
- Timestep \(t_s\) as a Control Knob: Larger \(t_s\) (more noise) leads to higher MUSIQ but lower PSNR, showing a clear trade-off. Default is \(t_s=100\).
- VSD Loss is Critical: Removing it drops MUSIQ from 67.05 to 65.56 and CLIPIQA+ from 0.5589 to 0.5009.
Highlights & Insights¶
- Intuitive "Inject noise where uncertain" approach: Changing noise from uniform to gradient-weighted solves both flat-region artifacts and texture deficiency with a lightweight implementation.
- Clean fidelity restoration: Bypassing VAE compression by modulating intermediate features rather than the latent itself provides a valuable lesson for conditions injection in latent diffusion models.
- Practical TMG design: Moving the cost of heavy segmentation modules (Grounded-SAM2) to the training phase via offline caching allows inference to remain a single forward pass.
- Temporal Coupling: Coefficients \(\lambda_t\) and \(w_t\) synchronize RGPA and LQFM intensity over time, ensuring modules work in a coupled mechanism.
Limitations & Future Work¶
- Complexity relies on multiple external pre-trained components (Grounded-SAM2, RAM, DAPE, SD 2.1). Although masks are cached, training costs and dependency on these models persist.
- Fidelity metrics (PSNR/SSIM) are not SOTA; not necessarily the first choice for scenarios requiring absolute pixel fidelity.
- TMG relies on noun-level masks, which might fail for abstract textures or scenes without clear objects (e.g., clear sky).
- Mapping operator \(\mathcal{W}(\cdot)\) is somewhat empirical.
Related Work & Insights¶
- vs. OSEDiff: CODSR is a systemic enhancement, adding RGPA, LQFM, and TMG to the OSEDiff framework.
- vs. PiSA-SR / TVT: While they use dual LoRA or VAE fine-tuning, CODSR's use of uncompressed pixel-unshuffle features bypasses the fundamental compression loss better.
- vs. DAPE: Unlike prior methods that extract prompts without spatial alignment, CODSR explicitly constrains text-vision spatial consistency via Grounded-SAM2 masks.
Rating¶
- Novelty: ⭐⭐⭐⭐ Targeted designs for one-step diffusion SR. RGPA is particularly novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive testing across four datasets and eight metrics with thorough ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear mapping between pain points and designs, though some details are deferred to supplementary materials.
- Value: ⭐⭐⭐⭐ Significant practical value in fidelity-reality balancing; code is open source.
Related Papers¶
- [CVPR 2026] One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution
- [CVPR 2026] IFCSR: Inference-Free Fidelity-Realism Control for One-Step Diffusion-based Real-World Image Super-Resolution
- [CVPR 2026] FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution
- [CVPR 2026] Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution
- [CVPR 2026] Language-Guided One-Step Diffusion Model for Nighttime Flare Removal