Disentangled Textual Priors for Diffusion-based Image Super-Resolution¶
Conference: CVPR 2026 arXiv: 2603.07430 Code: GitHub Area: Image Super-Resolution Keywords: Diffusion-based SR, text guidance, disentangled priors, frequency-awareness, semantic control
TL;DR¶
This paper proposes DTPSR, which disentangles textual priors along two orthogonal dimensions — spatial hierarchy (global/local) and frequency semantics (low-frequency/high-frequency) — and constructs a disentangled cross-attention injection pipeline along with a multi-branch CFG strategy, achieving superior perceptual quality in diffusion-based image super-resolution.
Background & Motivation¶
Diffusion models (e.g., Stable Diffusion) have demonstrated strong generative capabilities in image super-resolution, yet their performance critically depends on how semantic priors are constructed and injected. Existing methods exhibit two categories of limitations:
Insufficient semantic granularity: Local-tag methods (SeeSR) focus on fine details but lack global consistency; global-description methods (SUPIR, PASD) capture global structure but neglect fine-grained details.
Entangled frequency information: Existing methods embed structural information (low-frequency: shape, layout) and texture information (high-frequency: edges, material) within the same representation, leading to insufficient semantic controllability and interpretability.
Hallucination under severe degradation: Without disentangled semantic guidance, diffusion models are prone to hallucinations, such as misinterpreting a wall as an ocean texture.
Core insight: Disentangling textual priors along two orthogonal dimensions — spatial hierarchy and frequency semantics — enables the model to simultaneously capture scene-level structure and object-level detail.
Method¶
Overall Architecture¶
Given a low-resolution image \(x_{lr}\), the DTPSR pipeline proceeds as follows: 1. A VAE encoder maps \(x_{lr}\) to the latent space \(z_0\). 2. Forward diffusion adds noise to obtain \(z_t\). 3. During reverse denoising, semantic injection is performed sequentially through four dedicated cross-attention modules:
Key Designs¶
-
Global Text Cross-Attention (GTCA): The global description \(c_g\) is encoded by CLIP into \(e_g\) and injected into the latent variable via cross-attention to establish scene-level structural foundations. Design motivation: establishing global layout first (e.g., "an indoor scene with 3 objects") before progressive refinement.
-
Low-Frequency Cross-Attention (LFCA): A set of low-frequency local descriptions \(\{c_{lf}^{(i)}\}\) (shape, size, spatial arrangement) is encoded and injected into \(z_t^g\) for object-level structural enhancement: $\(E_{lf} = [\text{CLIP\_TextEnc}(c_{lf}^{(1)}), \dots, \text{CLIP\_TextEnc}(c_{lf}^{(n)})]\)$ Design motivation: low-frequency information governs structural fidelity; separating it from high-frequency texture prevents mutual interference.
-
High-Frequency Cross-Attention (HFCA): High-frequency descriptions \(\{c_{hf}^{(j)}\}\) (texture, edges, surface details) are encoded and injected on top of the LFCA output: $\(z_t^{hf} = \text{HFCA}(z_t^{lf}, E_{hf})\)$ Design motivation: high-frequency information governs visual realism; independent injection enables precise control over texture generation.
-
Low-Resolution Feature Cross-Attention (LRCA): A frozen DAPE encoder extracts visual features \(f_{lr}\) from \(x_{lr}\), which are injected via cross-attention to anchor identity consistency with the input image and prevent semantic drift.
-
Multi-branch Classifier-Free Guidance (Multi-branch CFG): Independent negative prompts \(c_g^{\text{neg}}, c_{lf}^{\text{neg}}, c_{hf}^{\text{neg}}\) are designed for the global, low-frequency, and high-frequency branches respectively, enabling disentangled semantic suppression: $\(\tilde{\epsilon} = \hat{\epsilon} + \lambda_s(\hat{\epsilon} - \hat{\epsilon}_{\text{neg}})\)$ Design motivation: a single negative prompt cannot simultaneously address hallucinations from multiple semantic sources.
Loss & Training¶
- Training loss: Standard noise prediction MSE loss $\(\mathcal{L} = \mathbb{E}[\|\epsilon - \epsilon_\theta(z_t, z_{lr}, t, c_g, c_{lf}, c_{hf})\|_2^2]\)$
- Dataset DisText-SR: ~95K image-text pairs built on LSDIR + the first 10K images from FFHQ, with disentangled descriptions generated via Mask2Former segmentation + LLaVA captioning.
- Backbone: SD-2-base; DAPE encoder for LR embedding extraction.
- Training configuration: AdamW optimizer, lr \(5 \times 10^{-5}\), batch size 32, 110K iterations, 4× A800 GPUs.
- Inference: DDPM 50 steps, guidance scale \(\lambda_s = 7.0\).
Key Experimental Results¶
Main Results¶
| Dataset | Metric | DTPSR | FaithDiff | SUPIR | Gain |
|---|---|---|---|---|---|
| DIV2K-Val | MUSIQ↑ | 71.24 | 69.18 | 62.59 | +2.06 |
| DIV2K-Val | MANIQA↑ | 0.5866 | 0.4309 | 0.5224 | +0.0642 |
| DIV2K-Val | CLIPIQA↑ | 0.7549 | 0.6463 | 0.7040 | +0.0509 |
| RealSR | MUSIQ↑ | 71.84 | 68.86 | 58.51 | +2.98 |
| RealSR | MANIQA↑ | 0.6021 | 0.4644 | 0.4429 | +0.0432 |
| DRealSR | CLIPIQA↑ | 0.7640 | 0.6335 | 0.6307 | +0.0729 |
Note: DTPSR achieves state-of-the-art performance on all no-reference perceptual metrics, though PSNR/SSIM are lower than GAN-based methods (perception–distortion tradeoff).
Ablation Study¶
| Configuration | MANIQA↑ | CLIPIQA↑ | MUSIQ↑ | Note |
|---|---|---|---|---|
| No prior | 0.5271 | 0.7064 | 67.48 | Baseline |
| Local only | 0.5851 | 0.7471 | 68.86 | Local prior contributes more |
| Global only | 0.5394 | 0.7211 | 67.80 | Global provides moderate gain |
| Global + Local | 0.6011 | 0.7640 | 69.24 | Complementary effect is optimal |
| Frequency entangled | 0.5947 | 0.7527 | 69.05 | Disentangled outperforms entangled |
| Frequency disentangled | 0.6011 | 0.7640 | 69.24 | Separate injection is more effective |
Key Findings¶
- Local priors contribute substantially more than global priors (MANIQA gain: 0.0580 vs. 0.0123).
- Frequency disentanglement consistently outperforms frequency mixing across all metrics.
- Multi-branch CFG significantly improves perceptual quality over single or no CFG (MUSIQ: 66.73→69.24).
- Even when text descriptions are randomly corrupted (replaced with "None"), DTPSR still outperforms competing methods, demonstrating robustness.
- With 10.5B parameters and 14.94s/image inference time, DTPSR is more efficient than SUPIR (17.8B) and FaithDiff (15.6B).
Highlights & Insights¶
- Elegant disentanglement design: Decomposing textual priors along spatial hierarchy × frequency semantics as two orthogonal dimensions is conceptually clear and empirically effective.
- DisText-SR dataset: The first large-scale SR dataset with combined global–local and low-frequency–high-frequency textual annotations, providing a foundation for controllable super-resolution research.
- Multi-branch CFG strategy: Achieves fine-grained hallucination suppression via frequency-aware negative prompts without additional training.
- Robustness experiments: The system remains functional even when upstream modules (segmentation, captioning) produce imperfect outputs.
Limitations & Future Work¶
- Full-reference metrics such as PSNR/SSIM fall short of GAN-based methods, reflecting the inherent perception–distortion tradeoff.
- The framework depends on the quality of upstream segmentation (Mask2Former) and captioning (LLaVA) models.
- Running segmentation and caption generation at inference time introduces additional end-to-end latency.
- Only the top-3 largest segmentation regions are processed, potentially missing small but semantically important regions.
- Future directions include adaptive prompt correction, tighter integration with upstream modules, and more efficient diffusion backbones.
Related Work & Insights¶
- SeeSR: Employs local semantic tags but focuses solely on details, lacking global consistency.
- SUPIR/PASD/FaithDiff: Utilize global descriptions but neglect frequency separation.
- StableSR/DiffBIR: Do not leverage textual semantics, thus failing to fully exploit diffusion priors.
- Insight: The disentangled textual prior paradigm is potentially generalizable to other conditional generation tasks such as image editing and inpainting.
Rating¶
- Novelty: ⭐⭐⭐⭐ The spatial–frequency dual disentanglement design for textual priors is novel; the multi-branch CFG strategy is creative.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dataset, multi-metric evaluation with comprehensive ablations (global/local, frequency, CFG, robustness).
- Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated, architectural diagrams are informative, and experiments are well-organized.
- Value: ⭐⭐⭐⭐ Establishes a new paradigm for text-guided diffusion super-resolution; the DisText-SR dataset offers practical utility.