The Power of Context: How Multimodality Improves Image Super-Resolution¶

Conference: CVPR 2025
arXiv: 2503.14503
Code: Project Page
Area: Image Super-Resolution / Image Segmentation
Keywords: Multimodal Super-Resolution, Diffusion Models, Classifier-Free Guidance, Multimodal Fusion, Image Restoration

TL;DR¶

Proposes MMSR, a diffusion-based super-resolution method that integrates multimodal information including depth, semantic segmentation, edge, and textual description, effectively suppressing hallucinations and improving SR quality through a Multimodal Latent Connector and multimodal CFG.

Background & Motivation¶

Single Image Super-Resolution (SISR) is an ill-posed problem, making the restoration of high-frequency details from low-resolution inputs highly challenging.
In recent years, diffusion-model-based super-resolution methods have made significant progress, often leveraging text prompts to activate pre-trained generative priors.
However, relying solely on textual descriptions poses inherent limitations: text cannot accurately convey spatial relationships, restricting texture descriptions to global applications.
For instance, when using "lion" as a prompt, the model may generate fur textures on the tongue—because lions possess fur, but tongues should not have hair.
Spatial modalities such as depth maps and semantic segmentation can provide complementary spatial information, reducing uncertainty in the super-resolution process.
From an information-theoretic perspective, the introduction of an auxiliary modality \(m\) reduces the entropy of the conditional distribution: \(H(p(\mathbf{x}|\mathbf{x}_{LR})) \geq H(p(\mathbf{x}|\mathbf{x}_{LR}, m))\).
Existing multimodal methods (e.g., ControlNet, IP-Adapter) duplicate network components for each modality, incurring huge computational overheads.
A highly efficient and flexible architecture is required to fuse an arbitrary number of modalities without modifying the underlying diffusion network.

Method¶

Overall Architecture¶

MMSR is constructed based on a pre-trained text-to-image diffusion model (sharing the architecture of Stable Diffusion v2). During inference, four modalities are first extracted from the low-resolution image: textual descriptions generated by Gemini Flash, depth maps estimated by Depth Anything, semantic segmentation maps from Mask2Former, and edges extracted via Canny. All modalities are encoded into unified tokens, then compressed into fixed-length latent tokens via the Multimodal Latent Connector, which serve as cross-attention conditions for the diffusion model. The low-resolution image provides additional conditioning through concatenation (similar to InstructPix2Pix).

Key Designs¶

1. Token-wise Multimodal Encoding - Function: Encodes heterogeneous modalities into a unified token sequence without altering the diffusion network architecture. - Mechanism: Employs a pre-trained VQGAN image tokenizer to encode depth/segmentation/edges into a \(16 \times 16\) discrete token sequence (feature dimension 256, codebook size 1024). Discrete tokens preserve modal-specific details better than continuous tokens, avoiding reconstruction artifacts. The token sequence is concatenated with text embeddings for cross-attention. - Design Motivation: Discrete quantization outperforms continuous representations in preserving modality-specific information, while the unified token format allows flexible addition or removal of modalities. A learnable null token \(m_\emptyset\) is introduced to represent missing modalities (randomly substituted with a 10% probability during training) to enhance robustness.

2. Multimodal Latent Connector (MMLC) - Function: Efficiently compresses long multimodal token sequences, reducing the computational complexity of cross-attention in the diffusion model. - Mechanism: Uses a set of learnable latent tokens (128) to extract key information from the complete multimodal sequence (\(256 \times 3 + 77 = 845\) tokens) through cross-attention, which is then further integrated via self-attention to output fixed-length conditioning tokens. - Design Motivation: Direct cross-attention with the complete multimodal sequence has a complexity of \(\mathcal{O}(M^2)\), while MMLC reduces it to \(\mathcal{O}(MN)\) (\(N \ll M\)), achieving linear complexity. Ablation studies demonstrate that MMLC not only improves efficiency but also reduces hallucination artifacts.

3. Multimodal Classifier-Free Guidance (m-CFG) - Function: Suppresses hallucinations and spurious details under high guidance scales, improving the balance between perceptual quality and fidelity. - Mechanism: The negative guidance of traditional CFG utilizes only the empty text token, resulting in weaker negative constraints. m-CFG employs multimodal latent tokens in both the positive and negative generation pathways: \(\tilde{\epsilon}(\mathbf{z}_t, c, m) = (1+w)\epsilon(\mathbf{z}_t, c, \text{pos}, m) - w\epsilon(\mathbf{z}_t, c, \text{neg}, m)\). - Design Motivation: Multimodal information strengthens the constraint power of negative guidance, effectively suppressing color shifts and erroneous textures at high guidance scales (10-14), where traditional CFG suffers severe performance degradation.

Loss & Training¶

The standard training loss of diffusion models is adopted, i.e., the mean squared error between the predicted and ground-truth noise:

\[\mathcal{L} = \mathbb{E}_{\mathbf{x}_0, t, \epsilon}\left[\|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, \mathbf{x}_{LR}, m, t)\|^2\right]\]

The training data utilizes LSDIR and DIV2K datasets, with low-resolution images generated via the RealESRGAN degradation strategy.

Key Experimental Results¶

Main Results¶

Method	LPIPS↓	DISTS↓	NIQE↓	FID↓	MUSIQ↑	CLIPIQA↑
R-ESRGAN	0.3868	0.2601	4.92	53.46	58.64	0.5424
StableSR	0.4055	0.2542	4.66	36.57	62.95	0.6486
SeeSR	0.3843	0.2257	4.93	31.93	68.33	0.6946
MMSR	0.3707	0.2071	4.25	29.35	70.06	0.7164

DIV2K-Val-3k 512×512 benchmark, MMSR leads comprehensively across all perceptual quality metrics

Ablation Study¶

Component	MUSIQ↑	NIQE↓	DISTS↓	LPIPS↓
w/o MMLC	69.69	3.48	0.1781	0.3929
w. MMLC	72.31	3.42	0.1492	0.2810

DIV2K-Val-100 1024p, MMLC improves performance across all metrics

Guidance Type	LPIPS@w=2	LPIPS@w=10	LPIPS@w=14
cfg	0.3239	0.4491	0.5064
\(m_\emptyset\)-cfg	0.2815	0.4803	0.5493
m-cfg	0.2810	0.3471	0.3772

m-CFG significantly suppresses LPIPS degradation at high guidance scales

Key Findings¶

Depth information primarily improves perceptual quality (MUSIQ), while segmentation and edge maps contribute more to preserving identity consistency (DISTS).
The default multimodal setting achieves the best balance between perceptual quality and identity preservation.
By adjusting the attention temperature \(\delta \in [0.4, 10]\) of each modality, fine-grained control can be achieved: reducing depth temperature enhances the depth-of-field effect, lowering segmentation temperature highlights specific object features, and decreasing edge temperature reinforces detail sharpness.

Highlights & Insights¶

Motivation from an Information-Theoretic Perspective: Employs the non-negativity of conditional mutual information to prove that multimodal information inherently reduces uncertainty in super-resolution, providing strong theoretical support.
Unified Token Representation: Utilizes VQGAN to unify heterogeneous modalities into discrete tokens, avoiding the high cost of duplicating networks for each modality in traditional methods.
Modality-Level Controllability: For the first time in super-resolution tasks, independent and continuous adjustment of each modality's influence is achieved, opening up new avenues for user interaction.

Limitations & Future Work¶

The extraction of multimodal information introduces computational overhead (with Gemini Flash at only 0.34 img/s), posing a bottleneck for inference speed.
When the low-resolution input is severely degraded, the extracted multimodal information itself might be inaccurate (such as distorted edges or erroneous segmentations).
Future work could explore faster vision-language models and more robust modality extraction modules.

Compared with ControlNet and IP-Adapter, this method replaces network replication with unified token encoding, offering greater efficiency and flexibility.
Text-driven methods such as SeeSR and PASD only utilize a single textual modality, lacking spatial guidance capabilities.
The concept of multimodal CFG can be extended to other conditional generation tasks to strengthen negative guidance constraints.

Rating¶

⭐⭐⭐⭐ — The method design is elegant, derived from a clear and powerful information-theoretic motivation, and the multimodal fusion architecture is efficient and practical. The experiments comprehensively cover both synthetic and real-world scenarios with thorough ablation studies. Modality-level controllability is a valuable new feature, though computational overhead and dependence on the quality of predicted modalities remain constraints for practical applications.