ControlFusion: A Controllable Image Fusion Framework with Language-Vision Degradation Prompts¶

Conference: NeurIPS 2025 arXiv: 2503.23356 Code: https://github.com/Linfeng-Tang/ControlFusion Area: Image Fusion / Multimodal Keywords: Infrared-visible fusion, degradation restoration, language-vision prompts, CLIP, controllable fusion

TL;DR¶

This paper proposes ControlFusion, a controllable infrared-visible image fusion framework based on language-vision degradation prompts. It employs a physics-driven degradation imaging model to simulate compound degradations, and uses a prompt-modulated network to perform dynamic restoration and fusion, achieving comprehensive state-of-the-art performance under both real-world and compound degradation scenarios.

Background & Motivation¶

Background: Infrared-visible image fusion (IVIF) integrates thermal information and texture details, with broad applications in security, military detection, and autonomous driving. Existing methods encompass CNN/AE/GAN/Transformer/diffusion model architectures.

Limitations of Prior Work: - Degradation-robust methods rely on simple data construction strategies, resulting in a domain gap between synthetic and real images. - Existing methods handle only single-type degradations and cannot address compound degradations in real-world scenarios (e.g., simultaneous low-light, noise, and blur). - There is a lack of degradation severity modeling, causing sharp performance drops as degradation intensifies; personalized user requirements also cannot be accommodated.

Key Challenge: Real-world degradations vary widely in type and severity combinations, making fixed fusion networks inflexible.

Key Insight: Construct a physics-based degradation imaging model to reduce the synthetic-real domain gap; use language prompts for explicit modeling of degradation type and severity; and employ a visual adapter for automatic degradation awareness.

Core Idea: Language prompts specify degradation type/severity + visual adapter automatically perceives degradation → dynamic modulation of feature restoration and fusion.

Method¶

Overall Architecture¶

Two-stage training: Stage I aligns text embeddings with visual embeddings (training the visual adapter); Stage II trains the end-to-end restoration-fusion network. At inference, two modes are supported: user-provided text prompts to specify degradation, or automatic degradation perception via the visual adapter.

Key Designs¶

Physics-Driven Degradation Imaging Model:
- Function: Simulates degradation for infrared/visible images separately, constructing the DDL-12 training dataset (12 degradation types × 4 severity levels, approximately 48,000 training pairs).
- Mechanism: \(D_m = \mathcal{P}_s(\mathcal{P}_w(\mathcal{P}_i(I_m)))\), employing three nested degradation layers — illuminance degradation (Retinex theory, \(\gamma \in [0.5, 3]\)), weather degradation (atmospheric scattering model, including rain and fog), and sensor degradation (noise + motion blur + contrast reduction).
- Design Motivation: Physics-based degradation simulation is closer to real-world scenarios than random degradation. Infrared and visible modalities exhibit different degradation types (infrared primarily suffers from stripe noise and low contrast; visible images mainly suffer from low-light/overexposure/rain/fog), necessitating separate modeling.
Spatial-Frequency Collaborative Visual Adapter (SFVA):
- Function: Automatically extracts degradation descriptor embeddings from degraded images, replacing manual text input.
- Mechanism: A frequency branch applies FFT to extract frequency-domain degradation priors (\(F_{fre}^m = \sum_{x,y} D_m(x,y) e^{-j2\pi(\frac{ux}{W} + \frac{vy}{H})}\)); a spatial branch employs CNN to extract spatial features; the two branches are concatenated and linearly projected to obtain the visual embedding \(p_{vis}\).
- Design Motivation: Different degradations exhibit distinct frequency-domain characteristics (e.g., noise concentrates in high frequencies; blur causes high-frequency attenuation), which the frequency branch effectively captures. MSE and cosine similarity losses ensure semantic alignment between \(p_{vis}\) and \(p_{text}\).
Prompt-Modulated Module (PMM):
- Function: Dynamically modulates fusion features according to degradation prompts.
- Mechanism: An MLP generates scaling parameter \(\gamma_p\) and shift parameter \(\beta_p\) from prompt \(p\): \(\hat{F}_f = (1 + \gamma_p) \odot F_f + \beta_p\), implementing FiLM-style feature modulation.
- Design Motivation: Different degradations require distinct feature enhancement strategies; learnable affine transformations enable conditional restoration.
Cross-Modal Cross-Attention Fusion Layer:
- Function: Exchanges Query vectors between modalities to facilitate cross-modal feature interaction.
- Mechanism: \(F_f^{ir} = \text{softmax}(\frac{Q_{vi}K_{ir}}{\sqrt{d_k}})V_{ir}\), using the visible-domain Query to retrieve infrared Key-Value pairs, and vice versa.
- Design Motivation: Cross-modal Query exchange promotes spatially aligned complementary information fusion.

Loss & Training¶

Stage I: \(\mathcal{L}_I = \lambda_1 \|p_{vis} - p_{text}\|^2 + \lambda_2 (1 - \cos(p_{vis}, p_{text}))\)
Stage II: Weighted combination of intensity loss + SSIM loss + maximum gradient loss + color consistency loss.

Key Experimental Results¶

Main Results (Standard Fusion Benchmarks)¶

Method	MSRS-VIF	LLVIP-VIF	RoadScene-VIF	FMB-VIF
Text-DiFuse	0.850	0.883	0.683	0.793
ControlFusion	0.927	0.968	0.817	0.872

Performance under Degradation (CLIP-IQA / MUSIQ Metrics)¶

ControlFusion achieves the best or second-best results across all degradation types (blur, rain, low-light, overexposure, noise, stripe noise, low contrast) and compound degradations. The advantage is particularly pronounced under compound degradations (e.g., simultaneous low-light, noise, and rain).

Ablation Study¶

Configuration	EN	SD	VIF	Qabf
Full model	Best	Best	Best	Best
w/o SFVA (text only)	Significant drop	-	-	-
w/o PMM	Notable drop	-	-	-
w/o physics degradation model	Poor real-world generalization	-	-	-

Key Findings¶

Visual embeddings generated by SFVA are highly aligned with manual text embeddings, enabling fully automated deployment.
The physics-driven degradation model significantly reduces the synthetic-real domain gap.
Performance remains stable across all 4 severity levels, without sharp degradation under severe conditions.

Highlights & Insights¶

Dual-channel language-vision degradation description paradigm: Text prompts enable user controllability while the visual adapter enables automation, with both channels semantically aligned — this paradigm is transferable to any conditional image restoration task.
FiLM-style modulation for degradation adaptation: Simple affine transformations achieve powerful conditional effects, avoiding the need to train dedicated models for each degradation type.
Physics-driven degradation simulation: The combination of Retinex, atmospheric scattering, and sensor noise modeling is more reliable than purely data-driven approaches.

Limitations & Future Work¶

Text prompt templates are relatively fixed, limiting flexibility.
The discrete 4-level severity quantization may lack sufficient granularity.
SFVA's degradation awareness depends on the alignment quality achieved in Stage I.
Generalizability to other multimodal fusion tasks (e.g., MRI-CT, multispectral) has not been validated.

vs. Text-IF: Text-IF requires manually crafted text prompts for each scene; ControlFusion achieves automation via SFVA.
vs. Text-DiFuse: Text-DiFuse is diffusion-based but does not address compound degradations; ControlFusion explicitly models combinations of multiple degradation types.
vs. DA-CLIP: DA-CLIP achieves degradation awareness by fine-tuning CLIP, but targets natural images only; ControlFusion is designed specifically for multimodal imagery.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of dual-channel language-vision degradation control and a physics-based degradation model is novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 4 datasets, 7 degradation types, compound degradations, and ablation studies — highly comprehensive.
Writing Quality: ⭐⭐⭐⭐ Well-organized with complete formulations.
Value: ⭐⭐⭐⭐ Practically significant for industrial deployment; handling real-world degradations is a critical pain point.