ControlFusion: A Controllable Image Fusion Framework with Language-Vision Degradation Prompts¶
Conference: NeurIPS 2025 arXiv: 2503.23356 Code: https://github.com/Linfeng-Tang/ControlFusion Area: Image Fusion / Multimodal Keywords: Infrared-visible fusion, degradation restoration, language-vision prompts, CLIP, controllable fusion
TL;DR¶
This paper proposes ControlFusion, a controllable infrared-visible image fusion framework based on language-vision degradation prompts. It employs a physics-driven degradation imaging model to simulate compound degradations, and uses a prompt-modulated network to perform dynamic restoration and fusion, achieving comprehensive state-of-the-art performance under both real-world and compound degradation scenarios.
Background & Motivation¶
Background: Infrared-visible image fusion (IVIF) integrates thermal information and texture details, with broad applications in security, military detection, and autonomous driving. Existing methods encompass CNN/AE/GAN/Transformer/diffusion model architectures.
Limitations of Prior Work: - Degradation-robust methods rely on simple data construction strategies, resulting in a domain gap between synthetic and real images. - Existing methods handle only single-type degradations and cannot address compound degradations in real-world scenarios (e.g., simultaneous low-light, noise, and blur). - There is a lack of degradation severity modeling, causing sharp performance drops as degradation intensifies; personalized user requirements also cannot be accommodated.
Key Challenge: Real-world degradations vary widely in type and severity combinations, making fixed fusion networks inflexible.
Key Insight: Construct a physics-based degradation imaging model to reduce the synthetic-real domain gap; use language prompts for explicit modeling of degradation type and severity; and employ a visual adapter for automatic degradation awareness.
Core Idea: Language prompts specify degradation type/severity + visual adapter automatically perceives degradation → dynamic modulation of feature restoration and fusion.
Method¶
Overall Architecture¶
Two-stage training: Stage I aligns text embeddings with visual embeddings (training the visual adapter); Stage II trains the end-to-end restoration-fusion network. At inference, two modes are supported: user-provided text prompts to specify degradation, or automatic degradation perception via the visual adapter.
Key Designs¶
-
Physics-Driven Degradation Imaging Model:
- Function: Simulates degradation for infrared/visible images separately, constructing the DDL-12 training dataset (12 degradation types × 4 severity levels, approximately 48,000 training pairs).
- Mechanism: \(D_m = \mathcal{P}_s(\mathcal{P}_w(\mathcal{P}_i(I_m)))\), employing three nested degradation layers — illuminance degradation (Retinex theory, \(\gamma \in [0.5, 3]\)), weather degradation (atmospheric scattering model, including rain and fog), and sensor degradation (noise + motion blur + contrast reduction).
- Design Motivation: Physics-based degradation simulation is closer to real-world scenarios than random degradation. Infrared and visible modalities exhibit different degradation types (infrared primarily suffers from stripe noise and low contrast; visible images mainly suffer from low-light/overexposure/rain/fog), necessitating separate modeling.
-
Spatial-Frequency Collaborative Visual Adapter (SFVA):
- Function: Automatically extracts degradation descriptor embeddings from degraded images, replacing manual text input.
- Mechanism: A frequency branch applies FFT to extract frequency-domain degradation priors (\(F_{fre}^m = \sum_{x,y} D_m(x,y) e^{-j2\pi(\frac{ux}{W} + \frac{vy}{H})}\)); a spatial branch employs CNN to extract spatial features; the two branches are concatenated and linearly projected to obtain the visual embedding \(p_{vis}\).
- Design Motivation: Different degradations exhibit distinct frequency-domain characteristics (e.g., noise concentrates in high frequencies; blur causes high-frequency attenuation), which the frequency branch effectively captures. MSE and cosine similarity losses ensure semantic alignment between \(p_{vis}\) and \(p_{text}\).
-
Prompt-Modulated Module (PMM):
- Function: Dynamically modulates fusion features according to degradation prompts.
- Mechanism: An MLP generates scaling parameter \(\gamma_p\) and shift parameter \(\beta_p\) from prompt \(p\): \(\hat{F}_f = (1 + \gamma_p) \odot F_f + \beta_p\), implementing FiLM-style feature modulation.
- Design Motivation: Different degradations require distinct feature enhancement strategies; learnable affine transformations enable conditional restoration.
-
Cross-Modal Cross-Attention Fusion Layer:
- Function: Exchanges Query vectors between modalities to facilitate cross-modal feature interaction.
- Mechanism: \(F_f^{ir} = \text{softmax}(\frac{Q_{vi}K_{ir}}{\sqrt{d_k}})V_{ir}\), using the visible-domain Query to retrieve infrared Key-Value pairs, and vice versa.
- Design Motivation: Cross-modal Query exchange promotes spatially aligned complementary information fusion.
Loss & Training¶
- Stage I: \(\mathcal{L}_I = \lambda_1 \|p_{vis} - p_{text}\|^2 + \lambda_2 (1 - \cos(p_{vis}, p_{text}))\)
- Stage II: Weighted combination of intensity loss + SSIM loss + maximum gradient loss + color consistency loss.
Key Experimental Results¶
Main Results (Standard Fusion Benchmarks)¶
| Method | MSRS-VIF | LLVIP-VIF | RoadScene-VIF | FMB-VIF |
|---|---|---|---|---|
| Text-DiFuse | 0.850 | 0.883 | 0.683 | 0.793 |
| ControlFusion | 0.927 | 0.968 | 0.817 | 0.872 |
Performance under Degradation (CLIP-IQA / MUSIQ Metrics)¶
ControlFusion achieves the best or second-best results across all degradation types (blur, rain, low-light, overexposure, noise, stripe noise, low contrast) and compound degradations. The advantage is particularly pronounced under compound degradations (e.g., simultaneous low-light, noise, and rain).
Ablation Study¶
| Configuration | EN | SD | VIF | Qabf |
|---|---|---|---|---|
| Full model | Best | Best | Best | Best |
| w/o SFVA (text only) | Significant drop | - | - | - |
| w/o PMM | Notable drop | - | - | - |
| w/o physics degradation model | Poor real-world generalization | - | - | - |
Key Findings¶
- Visual embeddings generated by SFVA are highly aligned with manual text embeddings, enabling fully automated deployment.
- The physics-driven degradation model significantly reduces the synthetic-real domain gap.
- Performance remains stable across all 4 severity levels, without sharp degradation under severe conditions.
Highlights & Insights¶
- Dual-channel language-vision degradation description paradigm: Text prompts enable user controllability while the visual adapter enables automation, with both channels semantically aligned — this paradigm is transferable to any conditional image restoration task.
- FiLM-style modulation for degradation adaptation: Simple affine transformations achieve powerful conditional effects, avoiding the need to train dedicated models for each degradation type.
- Physics-driven degradation simulation: The combination of Retinex, atmospheric scattering, and sensor noise modeling is more reliable than purely data-driven approaches.
Limitations & Future Work¶
- Text prompt templates are relatively fixed, limiting flexibility.
- The discrete 4-level severity quantization may lack sufficient granularity.
- SFVA's degradation awareness depends on the alignment quality achieved in Stage I.
- Generalizability to other multimodal fusion tasks (e.g., MRI-CT, multispectral) has not been validated.
Related Work & Insights¶
- vs. Text-IF: Text-IF requires manually crafted text prompts for each scene; ControlFusion achieves automation via SFVA.
- vs. Text-DiFuse: Text-DiFuse is diffusion-based but does not address compound degradations; ControlFusion explicitly models combinations of multiple degradation types.
- vs. DA-CLIP: DA-CLIP achieves degradation awareness by fine-tuning CLIP, but targets natural images only; ControlFusion is designed specifically for multimodal imagery.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of dual-channel language-vision degradation control and a physics-based degradation model is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 4 datasets, 7 degradation types, compound degradations, and ablation studies — highly comprehensive.
- Writing Quality: ⭐⭐⭐⭐ Well-organized with complete formulations.
- Value: ⭐⭐⭐⭐ Practically significant for industrial deployment; handling real-world degradations is a critical pain point.