MdaIF: Robust One-Stop Multi-Degradation-Aware Image Fusion with Language-Driven Semantics¶
Conference: AAAI 2026 arXiv: 2511.12525 Code: https://github.com/doudou845133/MdaIF Area: Image Fusion / Adverse Weather Degradation Keywords: Infrared-Visible Fusion, Degradation-Aware, Mixture of Experts, Vision-Language Model, Channel Attention
TL;DR¶
This paper proposes MdaIF, a framework that leverages a vision-language model (VLM) to extract degradation-aware semantic priors for guiding mixture-of-experts (MoE) routing and channel attention modulation, enabling one-stop infrared-visible image fusion across multiple degradation scenarios without requiring degradation-type annotations.
Background & Motivation¶
Background: Infrared-visible image fusion (IVF) aims to integrate infrared thermal radiation information with visible-light texture details. Existing methods have evolved from CNN/GAN-based approaches to Transformers and diffusion models, but most assume high-quality visible images.
Limitations of Prior Work: - Under adverse weather conditions (haze, rain, snow), visible images suffer severe degradation, making direct fusion ineffective. - Cascaded pipelines (restoration followed by fusion) introduce feature misalignment and error accumulation. - Existing degradation-aware fusion methods (Text-IF, MMAIF) rely on fixed degradation-type annotations as prompts and employ a single static network for all degradation conditions.
Key Challenge: Different degradation types (micron-scale water droplets in haze, millimeter-scale raindrops in rain, ice crystals in snow) correspond to fundamentally distinct atmospheric scattering models, which a fixed network architecture cannot effectively capture. For example, transmission maps effective for dehazing fail in deraining scenarios.
Goal: To adaptively handle infrared-degraded visible image fusion across multiple adverse weather conditions without relying on ground-truth degradation type labels.
Key Insight: Leveraging VLM scene-understanding capabilities to automatically identify degradation types and extract semantic priors, which then guide MoE-based expert selection for handling different degradations.
Core Idea: VLM provides degradation-aware semantic priors → semantic priors guide channel attention modulation via prototype decomposition → modulated features and semantic priors jointly guide MoE routing to select degradation-specific experts for fusion.
Method¶
Overall Architecture¶
MdaIF consists of four core modules:
- Encoder: Independently encodes infrared and degraded visible images.
- Degradation-aware Semantic Prior Extractor (DSPE): Extracts semantic priors from degraded visible images using the BLIP-2 VLM.
- Degradation-aware Channel Attention Module (DCAM): Performs degradation prototype decomposition and channel modulation using semantic priors.
- Degradation-aware Mixture of Experts (DMoE): Semantic-prior-guided expert routing combined with multi-expert fusion.
Key Designs¶
-
Degradation-aware Semantic Prior Extractor (DSPE):
- Employs pretrained BLIP-2 OPT 2.7B in VQA mode, taking the degraded visible image and an open-ended question prompt as input.
- Extracts last hidden-layer features \(S_{org} \in \mathbb{R}^{S \times C_{org}}\) as raw semantic priors.
- Compresses dimensionality via MLP + LayerNorm: \(S_{embed} = \mathcal{N}_{layer}(\Phi_m^I(S_{org}))\).
- Re-weights token importance via self-attention to obtain refined semantic priors \(S_{prior}\).
- Semantic priors comprise two components: \(S_{weather}\) (weather degradation knowledge) and \(S_{scene}\) (scene features).
- Key distinction: Rather than using the VLM solely as a degradation classifier, the method fully exploits its deep semantic scene understanding.
-
Degradation-aware Channel Attention Module (DCAM):
- Concatenates encoded infrared and visible features along the channel dimension: \(F_{in} = \text{Cat}(F_{vi}, F_{ir})\).
- Degradation prototype decomposition: Maps semantic priors through MLP→Sigmoid to produce activation scores \(s_K \in \mathbb{R}^K\) for \(K\) degradation prototypes.
- Each degradation prototype \(k_i \in \mathbb{R}^C\) encodes channel-wise response intensity; the prototype matrix \(W_{proto} \in \mathbb{R}^{K \times C}\) is initialized with orthonormal normalization.
- Channel weight computation: \(w_c = \sigma(\sum_{i=1}^K s_{K_i} \cdot k_i)\).
- Final modulation: \(F_{dcam} = \mathcal{N}_{layer}(F_{in}) \odot \sigma(s_K W_{proto}) + F_{in}\) (residual connection).
- Design Motivation: Different degradation types activate distinct prototype combinations, with each prototype favoring different channel patterns, enabling degradation-adaptive feature enhancement.
-
Degradation-aware Mixture of Experts (DMoE):
- Multiple expert networks are each specialized for different degradation conditions.
- Routing strategy: modulated features \(F_{dcam}\) interact with semantic priors \(S_{prior}\) to establish task-specific routing.
- \(S_{weather}\) enhances degradation texture features in the visible image; \(S_{scene}\) enhances target information in both infrared and visible modalities.
- Avoids expert load imbalance (one expert handling multiple tasks while others remain idle).
Loss & Training¶
- Joint optimization of degradation restoration and multimodal fusion (one-stop scheme, not cascaded).
- VLM (BLIP-2) parameters are frozen; only the encoder, DCAM, MoE, and decoder are trained.
- Degradation prototype matrix is initialized with orthonormal normalization and set as a learnable parameter.
Key Experimental Results¶
Main Results¶
One-stop methods vs. cascaded methods on MSRS dataset (Strategy I: separate models per degradation / Strategy II: unified model for all degradations):
MdaIF outperforms all cascaded combinations under all degradation conditions (Haze/Rain/Snow):
| Method | Haze PSNR↑ | Haze SSIM↑ | Rain PSNR↑ | Rain SSIM↑ | Snow PSNR↑ | Snow SSIM↑ |
|---|---|---|---|---|---|---|
| DehazeFormer+SegMiF | 17.051 | 1.046 | — | — | — | — |
| DRSformer+SegMiF | — | — | 17.308 | 0.859 | — | — |
| SnowFormer+SegMiF | — | — | — | — | 16.007 | 0.616 |
| SAGE (strongest cascade) | 17.260 | 1.231 | 17.964 | 0.993 | 17.267 | 0.897 |
| MdaIF (Ours) | 18.325 | 1.302 | 18.079 | 1.260 | 17.528 | 1.245 |
MdaIF achieves an average PSNR improvement of approximately 0.6–1.0 dB with substantial SSIM gains.
Ablation Study¶
| Degradation Prototype Analysis | Observation |
|---|---|
| Haze scenario | Prototype 1 has the highest activation (~40%); Prototypes 2/3 are lower |
| Rain scenario | Prototype 2 has the highest activation, notably different from haze |
| Snow scenario | Prototype 3 has the highest activation; latent correlations exist among the three degradations |
- Each prototype learns distinct channel preference patterns (verified via radar chart visualizations), enhancing mixed representational capacity.
Key Findings¶
- The one-stop scheme significantly outperforms cascaded approaches, demonstrating that jointly optimizing degradation restoration and fusion effectively avoids error accumulation.
- VLM-extracted semantic priors serve beyond degradation classification, providing deep scene-level understanding that enriches feature interaction.
- The degradation prototype decomposition mechanism leads the model to exhibit differentiated yet interrelated activation patterns across weather conditions.
Highlights & Insights¶
- The elimination of degradation-type annotation dependency is the key distinction from Text-IF and MMAIF — replacing manual annotation with VLM understanding improves practical applicability.
- Degradation prototype decomposition transforms the ambiguous notion of "degradation type" into interpretable channel response patterns, offering greater flexibility than direct one-hot routing via degradation labels.
- The MoE design is better suited than a single network for handling heterogeneous degradations, allowing each expert to focus on feature patterns associated with specific scattering models.
Limitations & Future Work¶
- Only three weather degradations (haze, rain, snow) are considered; other common degradations such as low-light and overexposure are not addressed.
- BLIP-2 OPT 2.7B is computationally heavy, impacting inference speed and deployment feasibility.
- The number of degradation prototypes \(K\) is a hyperparameter; adaptive determination is not discussed.
- Validation is performed only on synthetic degradation datasets; mixed real-world degradations (e.g., simultaneous haze and rain) are not examined.
- The VQA prompt design for VLM may affect prior quality, but the influence of different prompt choices is not analyzed in depth.
Related Work & Insights¶
- Text-IF (Yi et al. 2024): CLIP-based prompt-guided fusion, but limited to low-light/overexposure conditions and dependent on ground-truth degradation labels.
- MMAIF (Cao et al. 2025): Diffusion model + Flan-T5 LLM, also reliant on fixed prompts.
- SegFormer (Xie et al. 2021): Backbone for the encoder architecture in this work.
- Inspiration: The role of VLMs as "degradation-aware sensors" could be extended to broader low-level vision tasks (a unified model for denoising, super-resolution, and deblurring).
Rating¶
- Novelty: ⭐⭐⭐⭐ The VLM → degradation prototype → MoE routing pipeline is novel and eliminates reliance on degradation annotations.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dataset evaluation, comparison with multiple cascaded strategies, and degradation prototype visualization.
- Writing Quality: ⭐⭐⭐⭐ Problem definition is clear and method motivation is well articulated.
- Value: ⭐⭐⭐⭐ One-stop degradation-aware fusion is a practically essential direction.