Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach¶
Conference: CVPR 2025
arXiv: 2412.03017
Code: GitHub
Area: Image Restoration
Keywords: Image Super-Resolution, Dual LoRA, Adjustable Super-Resolution, Diffusion Models, Fidelity-Perception Trade-off
TL;DR¶
PiSA-SR is proposed, which decouples pixel-level regression and semantic-level enhancement into two independent weight spaces via a dual-LoRA module. This achieves single-step diffusion for high-quality super-resolution and supports flexible adjustment of fidelity and perceptual quality at inference time using two guidance scales.
Background & Motivation¶
Super-resolution based on diffusion priors faces a Key Challenge: - Pixel Fidelity vs. Perceptual Quality: \(\ell_2\) loss ensures fidelity but causes over-smoothing, whereas GAN/perceptual loss footprint enhances details but introduces artifacts. - Objective Entanglement: Existing SD-based methods mix both objectives within a single diffusion process, making optimization difficult. - Diverse User Preferences: Some users require content fidelity, while others favor rich semantic details, but models can only output a fixed style.
Core contribution: The SR problem is decomposed into two independently adjustable LoRA modules, allowing control over the SR style much like how CFG controls generation intensity.
Method¶
Overall Architecture¶
SD-based SR is formulated as residual learning \(z_H = z_L - \lambda \epsilon_\theta(z_L)\), training two LoRAs on SD 2.1: 1. Pixel-LoRA (\(\Delta\theta_{pix}\)): Removes degradation using \(\ell_2\) loss. 2. Semantic-LoRA (\(\Delta\theta_{sem}\)): Enhances semantic details using LPIPS + CSD loss. During inference, the output intensities of the two LoRAs are controlled independently via \(\lambda_{pix}\) and \(\lambda_{sem}\).
Key Design 1: Residual Learning Formulation¶
- Function: Simplifies multi-step diffusion SR into single-step residual prediction, supporting output scaling during inference.
- Mechanism: Single-step diffusion directly generates the HQ latent \(z_H = z_L - \epsilon_\theta(z_L)\) from the LQ latent \(z_L\), where the "noise" predicted by the U-Net is actually the residual from LQ to HQ. During training, \(\lambda=1\) is fixed, while during inference, \(\lambda\) is adjusted to control the enhancement intensity.
- Design Motivation: Residual learning allows the model to focus on high-frequency information, avoiding the extraction of irrelevant information from the LQ latent and accelerating convergence. More importantly, introducing the scaling factor \(\lambda\) enables adjustability during inference.
Key Design 2: Decoupled Dual-LoRA Training¶
- Function: Completely separates pixel fidelity and semantic enhancement into two distinct parameter spaces.
- Mechanism: First train the pixel-LoRA (\(\ell_2\) loss, 4K steps). Once fixed, train only the semantic-LoRA (LPIPS + CSD loss, 8.5K steps) within the combined PiSA-LoRA framework. During inference, the output is decomposed following the CFG concept: \(\epsilon_\theta = \lambda_{pix}\epsilon_{\theta_{pix}} + \lambda_{sem}(\epsilon_{\theta_{PiSA}} - \epsilon_{\theta_{pix}})\).
- Design Motivation: The sequential order of first removing degradation and then adding details ensures that semantic enhancement is not disturbed by noise or blur. The difference between the two LoRAs, \(\epsilon_{\theta_{PiSA}} - \epsilon_{\theta_{pix}}\), precisely isolates the pure semantic enhancement component, achieving orthogonal control.
Key Design 3: Replacing VSD with CSD Loss¶
- Function: Efficiently utilizes semantic priors from pre-trained SD for semantic enhancement.
- Mechanism: CSD (Classifier Score Distillation) loss extracts semantic gradients via the difference between conditional and unconditional noise predictions of SD, avoiding the bi-level optimization required by VSD. The gradient is \(\nabla\ell_{CSD} = \mathbb{E}[w_t(f(z_t, \epsilon_{real}) - f(z_t, \epsilon_{real}^{\lambda_{cfg}}))]\).
- Design Motivation: While VSD is effective, its \(\lambda_{cfg}=0\) component actually weakens semantic details; the CFG component of CSD is the core contributor to semantic enhancement. CSD bypasses the bi-level optimization of VSD, significantly reducing GPU memory consumption and training instability.
Loss & Training¶
- Pixel-LoRA: \(\mathcal{L}_{pix} = \|z_H^{pix} - z_{GT}\|_2^2\)
- Semantic-LoRA: \(\mathcal{L}_{sem} = \mathcal{L}_{LPIPS} + \mathcal{L}_{CSD}\)
Key Experimental Results¶
Main Results: Effect of RealSR Guidance Scales¶
| \(\lambda_{pix}\) | \(\lambda_{sem}\) | PSNR↑ | LPIPS↓ | CLIPIQA↑ | MUSIQ↑ |
|---|---|---|---|---|---|
| 0.0 | 1.0 | 25.96 | 0.3426 | 0.4129 | 46.45 |
| 0.5 | 1.0 | 26.75 | 0.2646 | 0.5705 | 63.82 |
| 1.0 | 1.0 | 25.50 | 0.2672 | 0.6702 | 70.15 |
| 1.0 | 0.0 | ~28+ | ~0.35 | ~0.35 | ~40 |
Increasing \(\lambda_{pix}\) eliminates degradation and improves PSNR, but values that are too large cause over-smoothing. Increasing \(\lambda_{sem}\) enriches details and improves perceptual metrics, but excessive values introduce artifacts.
Comparison with SOTA Methods (Synthetic Data)¶
Under default settings, PiSA-SR comprehensively outperforms or is competitive with methods such as StableSR, DiffBIR, SeeSR, and OSEDiff across metrics like PSNR, LPIPS, CLIPIQA, and MUSIQ, while requiring only a single-step diffusion.
Key Findings¶
- The difference between the dual LoRAs indeed purely isolates semantic details (as clearly demonstrated by visualization).
- CSD outperforms VSD: more stable, lower memory footprint, and stronger semantic enhancement.
- Single-step inference is more stable and faster than multi-step methods.
- User studies validate the practical value of adjustable SR.
Highlights & Insights¶
- Migration of CFG Concept to SR: Applying the conditional/unconditional separation idea of CFG in generative models to the pixel/semantic separation in SR is a natural and innovative approach.
- Adjustability \(\neq\) Retraining: The two guidance scales can be adjusted directly during inference, eliminating the need to train separate models for different preferences.
- Generality of LoRA Decoupling: The strategy of using dual LoRAs to decouple distinct optimization objectives can be generalized to other restoration tasks.
Limitations & Future Work¶
- Adjustability requires running the U-Net twice during inference (once for pixel-LoRA and once for PiSA-LoRA).
- The optimal values of \(\lambda_{pix}\) and \(\lambda_{sem}\) vary across different images, meaning no universal optimal setting exists.
- Only supports \(\times 4\) upscaling; retraining is required for other scales.
- Based on SD 2.1; exploration of SDXL or newer foundation models was not conducted.
Related Work & Insights¶
- OSEDiff: A pioneer in single-step diffusion SR; PiSA-SR builds on it by introducing dual-LoRA decoupling.
- SeeSR: Diffusion SR guided by semantic tags; PiSA-SR achieves more direct semantic enhancement using CSD loss.
- LDL: A method utilizing local statistics constraints in GAN-based SR, which is related to the pixel-level LoRA objective of PiSA-SR.
Rating¶
⭐⭐⭐⭐ — The concept of pixel/semantic decoupling is simple yet effective, offering strong practical utility for adjustable SR and high efficiency with single-step inference. The extra overhead of dual U-Net inference and parameter sensitivity are minor drawbacks.