XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution¶

Conference: ECCV 2024
arXiv: 2403.05049
Code: https://github.com/qyp2000/XPSR
Area: Image Restoration
Keywords: Image Super-Resolution, Diffusion Models, Multimodal Large Language Models (MLLMs), Cross-modal Semantic Priors, ControlNet

TL;DR¶

XPSR proposes utilizing high-level and low-level semantic descriptions generated by a Multimodal Large Language Model (LLaVA) as cross-modal priors. These priors are integrated into a diffusion model via Semantic-Fusion Attention, combined with a Degradation-Free Constraint to extract semantic-preserving features, achieving high-fidelity and highly realistic image super-resolution.

Background & Motivation¶

Background: Diffusion-based image super-resolution (ISR) leverages the generative priors of pre-trained text-to-image (T2I) models (such as Stable Diffusion) and injects low-resolution (LR) image information through mechanisms like ControlNet to restore high-resolution (HR) images. Representative methods include StableSR, DiffBIR, PASD, and SeeSR.
Limitations of Prior Work: (1) StableSR and DiffBIR set the text prompt to empty, relying completely on semantic extraction from LR images; however, the semantic information in LR images is severely lost after undergoing complex degradation. (2) PASD and SeeSR utilize tagging models to extract object categories as prompts, failing to capture complex information like spatial positions and scene layouts. (3) Existing prompts consistently overlook low-level information such as image quality, noise, and blur, which is crucial for ISR.
Key Challenge: The generation process of T2I diffusion models is fundamentally guided by text prompts, but in ISR scenarios, LR images are heavily degraded. Simple tag-level prompts cannot provide sufficiently rich semantic guidance, leading to incorrect content restoration or unrealistic artifacts.
Goal: (1) How to acquire accurate and comprehensive semantic conditions? (2) How to effectively fuse cross-modal priors of different levels? (3) How to extract semantic-preserving but degradation-ignored features from LR images?
Key Insight: The authors observe that high-level semantic priors (object description, spatial layout) help restore semantically correct content, while low-level semantic priors (quality, sharpness, noise) assist in modeling the degradation process for clearer restoration. Multimodal Large Language Models (MLLMs, such as LLaVA) are uniquely capable of perceiving both types of information simultaneously.
Core Idea: Generating high- and low-level dual-semantic prompts with an MLLM, fusing them into the diffusion model via parallel cross-attention, and extracting degradation-free features using pixel-latent dual-space constraints.

Method¶

Overall Architecture¶

XPSR consists of two stages: (1) Semantic Prior Generation: utilizing LLaVA to generate high-level descriptions (content, scene) and low-level descriptions (quality, noise) for LR images, which are encoded into two types of embeddings via a CLIP text encoder. (2) Image Restoration: based on the SD + ControlNet architecture, fusing the dual-level semantic priors using the proposed SFA, and constraining ControlNet with DFC to extract semantic-preserving features. During inference, only ControlNet, UNet, and the MLLM are required.

Key Designs¶

MLLM Semantic Prompt Generation:
- Function: Extracting high-level and low-level semantic descriptions from the LR image using LLaVA.
- Mechanism: Designing two instructions: a high-level instruction: "Please provide a descriptive summary of the content of this image" to generate descriptions containing objects, spatial layouts, and scenes; and a low-level instruction: "Please describe the quality of this image and evaluate it based on factors such as clarity, color, noise, and lighting" to generate descriptions for quality, clarity, noise, and lighting.
- Design Motivation: High-level priors provide rich semantics to ensure correct content restoration, while low-level priors help model the degradation process to achieve clearer restoration. Visualized experiments show that both are indispensable.
Semantic-Fusion Attention (SFA):
- Function: Effectively fusing high-level and low-level semantic priors into the diffusion model.
- Mechanism: Employing a parallel dual-branch cross-attention instead of a sequential structure. High-level and low-level priors interact with features through independent cross-attentions separately, and are then combined via a fusion attention (using the high-level result as $Q$ and the low-level result as $K/V$): $$\mathbf{x}_{k+1} = \mathcal{CA}_f(\mathcal{CA}_h(\mathbf{x}_k, c_h), \mathcal{CA}_l(\mathbf{x}_k, c_l))$$
- Design Motivation: Sequential structures tend to cause later-processed information to override former information, whereas the parallel structure achieves an adaptive and balanced selection of the two priors. The UNet only uses high-level attention (as its input is noise and does not require low-level degradation understanding), while ControlNet uses SFA for full fusion.
Degradation-Free Constraint (DFC):
- Function: Constraining ControlNet to extract features that preserve semantics but are independent of degradation.
- Mechanism: Applying $L_1$ constraints on dual levels: pixel space and latent space. In pixel space: mapping each layer of the ControlNet image encoder to an RGB image via convolution and aligning it with the downsampled HR image; in latent space: mapping various layers of the UNet encoder to the latent space and aligning them with the downsampled HR latent representation. $$\mathcal{L}_{DFC} = \sum_{i=1}^{3} \|x_{hr,i} - \hat{x}_i\|_1 + \sum_{j=1}^{3} \|z_{hr,j} - \hat{z}_j\|_1$$
- Design Motivation: LR images contain a mixture of degradation and semantic information. DFC forces the features to retain only semantics and discard degradation-related components through HR alignment.

Loss & Training¶

The total loss is $$\mathcal{L} = \mathcal{L}_D + \lambda \mathcal{L}_{DFC}$$ where $\mathcal{L}_D$ is the standard diffusion denoising loss. All parameters of SD are frozen, and only ControlNet and Conditional Attention are trained. During inference, classifier-free guidance is employed with the negative prompt set to "blurry, dotted, noise, unclear, low-res, over-smoothed".

Key Experimental Results¶

Main Results¶

Dataset	Method	CLIPIQA↑	MUSIQ↑	MANIQA↑	LIQE↑
DIV2K-Val	StableSR	0.621	64.22	0.395	4.13
DIV2K-Val	SeeSR	0.655	66.43	0.420	4.25
DIV2K-Val	XPSR	0.689	68.71	0.441	4.38
RealSR	StableSR	0.588	63.80	0.381	3.98
RealSR	XPSR	0.651	67.13	0.422	4.21

Ablation Study¶

Configuration	CLIPIQA	MUSIQ	Description
w/o High-level prompt	0.645	65.8	Inaccurate content semantic restoration
w/o Low-level prompt	0.652	66.1	Insufficient degradation modeling, blurry details
Sequential attention instead of SFA	0.661	67.0	Information overriding leads to suboptimal fusion
w/o DFC	0.658	66.5	Degradation information mixed into features
Full XPSR	0.689	68.71	Full model

Key Findings¶

High-level and low-level semantic priors are mutually complementary and irreplaceable: Removing either prompt type leads to a significant performance drop, confirming the necessity of dual-level semantic conditions.
Parallel SFA significantly outperforms sequential fusion: The sequential structure leads to suboptimal results due to information overriding.
DFC's pixel and latent dual-space constraints are both indispensable: Removing either space constraint individually degrades performance.
The accuracy of low-level prompts is crucial: Visualization shows that incorrect low-level descriptions (e.g., describing blur as clear) lead to a severe degradation in restoration quality.

Highlights & Insights¶

MLLM as a semantic condition generator for ISR: This is an elegant cross-domain application, where LLaVA's ability to simultaneously perceive high-level content and low-level quality perfectly compensates for the lack of semantic conditions in ISR.
Parallel fusion design in SFA: Utilizing a three-branch structure of parallel cross-attentions and fusion attention, it elegantly resolves the information overriding problem in multi-condition fusion. This can be transferred to any generative task requiring the fusion of multiple conditions.
Discovery of low-level semantic priors: This work explicitly points out the significant value of image quality/degradation descriptions for ISR, which was overlooked in prior works.

Limitations & Future Work¶

MLLM inference increases computational overhead, requiring an additional LLaVA call per image to generate descriptions.
LLaVA's perception of LR images may not always be accurate, especially under extreme degradations.
The training data uses synthetic degradation pipelines, which still has a distribution gap with real-world degradations.
Potential to explore end-to-end joint optimization schemes of MLLMs and SD.

vs SeeSR: SeeSR uses a tagging model to extract object tags as prompts, whereas XPSR uses an MLLM to obtain richer descriptions and low-level quality information, providing more comprehensive semantic conditions.
vs PASD: PASD also introduces semantic information but relies solely on object tags, whereas XPSR's MLLM scheme provides higher-level information such as spatial layout and scene understanding.
vs StableSR/DiffBIR: They do not use text conditions and rely entirely on LR image features, resulting in severe loss of semantic information under complex degradations.

Rating¶

Novelty: ⭐⭐⭐⭐ Introduces high-level and low-level semantic understanding of MLLMs into ISR for the first time.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensively evaluated on synthetic and real datasets with multiple metrics, along with a thorough ablation study.
Writing Quality: ⭐⭐⭐⭐ Clear structure and highly convincing visualizations.
Value: ⭐⭐⭐⭐ Establishes a paradigm of introducing MLLM semantic conditions into the field of ISR.