MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance¶
Conference: ICCV 2025 arXiv: 2503.14944 Code: GitHub Area: Diffusion Models · Image Fusion Keywords: Image Fusion, Diffusion Transformer, Multi-task, Multi-degradation, Language Guidance, Flow Matching, MoE
TL;DR¶
MMAIF proposes a unified multi-task, multi-degradation, language-guided image fusion framework that operates in latent space via a realistic degradation pipeline and a modernized DiT architecture. It offers both a regression and a Flow Matching variant, surpassing existing restoration+fusion pipelines across diverse degraded fusion tasks.
Background & Motivation¶
Image fusion aims to integrate multi-modal or multi-parameter image sequences into a single output (e.g., visible and infrared fusion VIF, multi-exposure fusion MEF, multi-focus fusion MFF). Existing methods face four key limitations:
Task-specific models: Separate networks are trained for each fusion task; a VIF model cannot be directly applied to MEF.
Neglect of real-world degradation: Models trained on clean images fail under noise, blur, rain, snow, and other degradations.
Expensive pixel-space computation: The quadratic complexity of Transformers is prohibitive in pixel space.
Lack of user interaction: No mechanism exists to guide restoration and fusion via language instructions.
The conventional solution of cascading image restoration networks before fusion increases inference complexity and may cause fusion failures on restored images.
Method¶
1. Realistic Degradation Pipeline¶
Task-specific degradation strategies are designed for VIF, MEF, and MFF: - General degradations: Gaussian blur, motion blur, downsampling, Gaussian noise, rain, haze, snow - VIF-specific: Low exposure, low contrast, infrared dark stripes - MEF-specific: Low contrast - MFF-specific: Low/high exposure
Each image pair is randomly sampled with \(n \in \{1, 3\}\) degradation combinations to simulate compound degradation scenarios. DepthAnything is used to estimate depth, after which an atmospheric scattering model applies more realistic haze effects.
Ground-truth images are generated by pretrained SwinFusion and DeFuse; ChatGPT is used to generate 10–20 interaction prompts per degradation type.
2. Image Tokenizer Selection¶
Three VAEs (\(f=8, z=16\)) are compared:
| Tokenizer | PSNR | SSIM |
|---|---|---|
| Flux KL-VAE | 33.41 | 0.9227 |
| Asy. KL-VAE | 33.10 | 0.9201 |
| Cosmos VAE | 34.02 | 0.9367 |
The Cosmos VAE is selected for its superior reconstruction performance.
3. Modernized DiT Architecture¶
Several improvements are introduced over the original DiT:
- MoE GLU: The FFN is replaced with a mixture of 4 experts plus 1 shared expert with token routing and load-balancing loss, providing greater capacity at lower FLOPs compared to dense models.
- 2D RoPE: Replaces absolute positional encoding, enabling better resolution generalization and length extrapolation.
- Per-block absolute positional encoding: Learnable positional encodings added before each block to eliminate artifacts during variable-resolution inference.
- Attention value residual: \(V^l = (1-\eta) \cdot W^V X + \eta V^{l-1}\), mitigating vanishing gradients in deep networks.
- LoRA AdaLN conditioning: The conditioning MLP is factorized into two smaller MLPs to reduce parameter count.
- NAFNet bias convolution: Convolutional blocks are inserted before attention to provide inductive bias for improved handling of blur degradations.
4. Regression and Flow Matching Variants¶
Regression variant (timestep embedding removed):
Flow Matching variant:
Auxiliary fusion loss: \(\mathcal{L}_{aux} = \sum_{i=0}^{m-1} \|\tilde{X} - X_i\|_1 + \|\nabla\tilde{X} - \nabla X_i\|_1\)
Key Experimental Results¶
Ablation Study¶
| Component | PSNR (VIF) | SSIM (VIF) |
|---|---|---|
| Base DiT | 31.02 | 0.892 |
| + MoE | 31.45 | 0.901 |
| + RoPE | 31.62 | 0.908 |
| + Value Residual | 31.78 | 0.912 |
| + NAFNet Conv | 32.15 | 0.921 |
Each component yields consistent performance gains, with NAFNet convolution contributing most to blur degradation handling.
Comparison with Restoration+Fusion Pipelines¶
| Method | Inference | PSNR | SSIM | Inference Time |
|---|---|---|---|---|
| Restormer+SwinFusion | Two-stage | 29.87 | 0.874 | Slow |
| TextIF | Single-stage (pixel space) | 30.45 | 0.889 | Medium |
| MMAIF-Reg | Single-stage (latent space) | 32.15 | 0.921 | Fast |
MMAIF substantially outperforms existing methods while simplifying the inference pipeline.
Highlights & Insights¶
- Three-in-one framework: Simultaneously addresses multi-task (VIF/MEF/MFF), multi-degradation, and language-guided fusion.
- Dual regression and Flow Matching variants: The regression variant enables fast inference, while Flow Matching performs better on degradations with weak priors (snow, rain).
- Each modernized DiT component is theoretically motivated and validated via ablation.
- Latent-space operation substantially reduces the computational overhead of Transformer-based processing.
Limitations & Future Work¶
- Ground-truth images are generated by pretrained networks rather than real annotations, potentially introducing bias.
- Only pairwise image fusion is supported; extension to multi-image fusion is required.
- MoE increases model complexity and training instability.
- Some baselines in comparison experiments may not be evaluated under fully identical conditions.
Related Work & Insights¶
- Image fusion: U2Fusion, SwinFusion, PSLPT, etc.
- Degraded image restoration and fusion: TextIF, DRMF, Text-DiFuse
- Diffusion Transformers: DiT, Flux, SD3, and related architectures
Rating¶
| Dimension | Score (1–5) |
|---|---|
| Novelty | 4 |
| Technical Depth | 4 |
| Experimental Thoroughness | 4 |
| Writing Quality | 4 |
| Overall | 4.0 |