Skip to content

MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Conference: ICCV 2025 arXiv: 2503.14944 Code: GitHub Area: Diffusion Models · Image Fusion Keywords: Image Fusion, Diffusion Transformer, Multi-task, Multi-degradation, Language Guidance, Flow Matching, MoE

TL;DR

MMAIF proposes a unified multi-task, multi-degradation, language-guided image fusion framework that operates in latent space via a realistic degradation pipeline and a modernized DiT architecture. It offers both a regression and a Flow Matching variant, surpassing existing restoration+fusion pipelines across diverse degraded fusion tasks.

Background & Motivation

Image fusion aims to integrate multi-modal or multi-parameter image sequences into a single output (e.g., visible and infrared fusion VIF, multi-exposure fusion MEF, multi-focus fusion MFF). Existing methods face four key limitations:

Task-specific models: Separate networks are trained for each fusion task; a VIF model cannot be directly applied to MEF.

Neglect of real-world degradation: Models trained on clean images fail under noise, blur, rain, snow, and other degradations.

Expensive pixel-space computation: The quadratic complexity of Transformers is prohibitive in pixel space.

Lack of user interaction: No mechanism exists to guide restoration and fusion via language instructions.

The conventional solution of cascading image restoration networks before fusion increases inference complexity and may cause fusion failures on restored images.

Method

1. Realistic Degradation Pipeline

Task-specific degradation strategies are designed for VIF, MEF, and MFF: - General degradations: Gaussian blur, motion blur, downsampling, Gaussian noise, rain, haze, snow - VIF-specific: Low exposure, low contrast, infrared dark stripes - MEF-specific: Low contrast - MFF-specific: Low/high exposure

Each image pair is randomly sampled with \(n \in \{1, 3\}\) degradation combinations to simulate compound degradation scenarios. DepthAnything is used to estimate depth, after which an atmospheric scattering model applies more realistic haze effects.

Ground-truth images are generated by pretrained SwinFusion and DeFuse; ChatGPT is used to generate 10–20 interaction prompts per degradation type.

2. Image Tokenizer Selection

Three VAEs (\(f=8, z=16\)) are compared:

Tokenizer PSNR SSIM
Flux KL-VAE 33.41 0.9227
Asy. KL-VAE 33.10 0.9201
Cosmos VAE 34.02 0.9367

The Cosmos VAE is selected for its superior reconstruction performance.

3. Modernized DiT Architecture

Several improvements are introduced over the original DiT:

  • MoE GLU: The FFN is replaced with a mixture of 4 experts plus 1 shared expert with token routing and load-balancing loss, providing greater capacity at lower FLOPs compared to dense models.
  • 2D RoPE: Replaces absolute positional encoding, enabling better resolution generalization and length extrapolation.
  • Per-block absolute positional encoding: Learnable positional encodings added before each block to eliminate artifacts during variable-resolution inference.
  • Attention value residual: \(V^l = (1-\eta) \cdot W^V X + \eta V^{l-1}\), mitigating vanishing gradients in deep networks.
  • LoRA AdaLN conditioning: The conditioning MLP is factorized into two smaller MLPs to reduce parameter count.
  • NAFNet bias convolution: Convolutional blocks are inserted before attention to provide inductive bias for improved handling of blur degradations.

4. Regression and Flow Matching Variants

Regression variant (timestep embedding removed):

\[\mathcal{L}_{reg} = \|f_\theta(Z_0^m, Z_1^m, P) - Z_{GT}\|_2^2\]

Flow Matching variant:

\[\mathcal{L}_{flow} = \mathbb{E}\|v_\theta(Z_t, t, Z_0^m, Z_1^m, P) - (GT - X_0)\|_2^2\]

Auxiliary fusion loss: \(\mathcal{L}_{aux} = \sum_{i=0}^{m-1} \|\tilde{X} - X_i\|_1 + \|\nabla\tilde{X} - \nabla X_i\|_1\)

Key Experimental Results

Ablation Study

Component PSNR (VIF) SSIM (VIF)
Base DiT 31.02 0.892
+ MoE 31.45 0.901
+ RoPE 31.62 0.908
+ Value Residual 31.78 0.912
+ NAFNet Conv 32.15 0.921

Each component yields consistent performance gains, with NAFNet convolution contributing most to blur degradation handling.

Comparison with Restoration+Fusion Pipelines

Method Inference PSNR SSIM Inference Time
Restormer+SwinFusion Two-stage 29.87 0.874 Slow
TextIF Single-stage (pixel space) 30.45 0.889 Medium
MMAIF-Reg Single-stage (latent space) 32.15 0.921 Fast

MMAIF substantially outperforms existing methods while simplifying the inference pipeline.

Highlights & Insights

  1. Three-in-one framework: Simultaneously addresses multi-task (VIF/MEF/MFF), multi-degradation, and language-guided fusion.
  2. Dual regression and Flow Matching variants: The regression variant enables fast inference, while Flow Matching performs better on degradations with weak priors (snow, rain).
  3. Each modernized DiT component is theoretically motivated and validated via ablation.
  4. Latent-space operation substantially reduces the computational overhead of Transformer-based processing.

Limitations & Future Work

  • Ground-truth images are generated by pretrained networks rather than real annotations, potentially introducing bias.
  • Only pairwise image fusion is supported; extension to multi-image fusion is required.
  • MoE increases model complexity and training instability.
  • Some baselines in comparison experiments may not be evaluated under fully identical conditions.
  • Image fusion: U2Fusion, SwinFusion, PSLPT, etc.
  • Degraded image restoration and fusion: TextIF, DRMF, Text-DiFuse
  • Diffusion Transformers: DiT, Flux, SD3, and related architectures

Rating

Dimension Score (1–5)
Novelty 4
Technical Depth 4
Experimental Thoroughness 4
Writing Quality 4
Overall 4.0