ICCV 2025 Image Generation Image Fusion Diffusion Transformer Multi-task Multi-degradation Language Guidance Flow Matching MoE

MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance¶

Conference: ICCV 2025 arXiv: 2503.14944 Code: GitHub Area: Diffusion Models · Image Fusion Keywords: Image Fusion, Diffusion Transformer, Multi-task, Multi-degradation, Language Guidance, Flow Matching, MoE

TL;DR¶

MMAIF proposes a unified multi-task, multi-degradation, language-guided image fusion framework that operates in latent space via a realistic degradation pipeline and a modernized DiT architecture. It offers both a regression and a Flow Matching variant, surpassing existing restoration+fusion pipelines across diverse degraded fusion tasks.

Background & Motivation¶

Image fusion aims to integrate multi-modal or multi-parameter image sequences into a single output (e.g., visible and infrared fusion VIF, multi-exposure fusion MEF, multi-focus fusion MFF). Existing methods face four key limitations:

Task-specific models: Separate networks are trained for each fusion task; a VIF model cannot be directly applied to MEF.

Neglect of real-world degradation: Models trained on clean images fail under noise, blur, rain, snow, and other degradations.

Expensive pixel-space computation: The quadratic complexity of Transformers is prohibitive in pixel space.

Lack of user interaction: No mechanism exists to guide restoration and fusion via language instructions.

The conventional solution of cascading image restoration networks before fusion increases inference complexity and may cause fusion failures on restored images.

Method¶

1. Realistic Degradation Pipeline¶

Task-specific degradation strategies are designed for VIF, MEF, and MFF: - General degradations: Gaussian blur, motion blur, downsampling, Gaussian noise, rain, haze, snow - VIF-specific: Low exposure, low contrast, infrared dark stripes - MEF-specific: Low contrast - MFF-specific: Low/high exposure

Each image pair is randomly sampled with \(n \in \{1, 3\}\) degradation combinations to simulate compound degradation scenarios. DepthAnything is used to estimate depth, after which an atmospheric scattering model applies more realistic haze effects.

Ground-truth images are generated by pretrained SwinFusion and DeFuse; ChatGPT is used to generate 10–20 interaction prompts per degradation type.

2. Image Tokenizer Selection¶

Three VAEs (\(f=8, z=16\)) are compared:

Tokenizer	PSNR	SSIM
Flux KL-VAE	33.41	0.9227
Asy. KL-VAE	33.10	0.9201
Cosmos VAE	34.02	0.9367

The Cosmos VAE is selected for its superior reconstruction performance.

3. Modernized DiT Architecture¶

Several improvements are introduced over the original DiT:

MoE GLU: The FFN is replaced with a mixture of 4 experts plus 1 shared expert with token routing and load-balancing loss, providing greater capacity at lower FLOPs compared to dense models.
2D RoPE: Replaces absolute positional encoding, enabling better resolution generalization and length extrapolation.
Per-block absolute positional encoding: Learnable positional encodings added before each block to eliminate artifacts during variable-resolution inference.
Attention value residual: \(V^l = (1-\eta) \cdot W^V X + \eta V^{l-1}\), mitigating vanishing gradients in deep networks.
LoRA AdaLN conditioning: The conditioning MLP is factorized into two smaller MLPs to reduce parameter count.
NAFNet bias convolution: Convolutional blocks are inserted before attention to provide inductive bias for improved handling of blur degradations.

4. Regression and Flow Matching Variants¶

Regression variant (timestep embedding removed):

\[\mathcal{L}_{reg} = \|f_\theta(Z_0^m, Z_1^m, P) - Z_{GT}\|_2^2\]

Flow Matching variant:

\[\mathcal{L}_{flow} = \mathbb{E}\|v_\theta(Z_t, t, Z_0^m, Z_1^m, P) - (GT - X_0)\|_2^2\]

Auxiliary fusion loss: \(\mathcal{L}_{aux} = \sum_{i=0}^{m-1} \|\tilde{X} - X_i\|_1 + \|\nabla\tilde{X} - \nabla X_i\|_1\)

Key Experimental Results¶

Ablation Study¶

Component	PSNR (VIF)	SSIM (VIF)
Base DiT	31.02	0.892
+ MoE	31.45	0.901
+ RoPE	31.62	0.908
+ Value Residual	31.78	0.912
+ NAFNet Conv	32.15	0.921

Each component yields consistent performance gains, with NAFNet convolution contributing most to blur degradation handling.

Comparison with Restoration+Fusion Pipelines¶

Method	Inference	PSNR	SSIM	Inference Time
Restormer+SwinFusion	Two-stage	29.87	0.874	Slow
TextIF	Single-stage (pixel space)	30.45	0.889	Medium
MMAIF-Reg	Single-stage (latent space)	32.15	0.921	Fast

MMAIF substantially outperforms existing methods while simplifying the inference pipeline.

Highlights & Insights¶

Three-in-one framework: Simultaneously addresses multi-task (VIF/MEF/MFF), multi-degradation, and language-guided fusion.
Dual regression and Flow Matching variants: The regression variant enables fast inference, while Flow Matching performs better on degradations with weak priors (snow, rain).
Each modernized DiT component is theoretically motivated and validated via ablation.
Latent-space operation substantially reduces the computational overhead of Transformer-based processing.

Limitations & Future Work¶

Ground-truth images are generated by pretrained networks rather than real annotations, potentially introducing bias.
Only pairwise image fusion is supported; extension to multi-image fusion is required.
MoE increases model complexity and training instability.
Some baselines in comparison experiments may not be evaluated under fully identical conditions.

Image fusion: U2Fusion, SwinFusion, PSLPT, etc.
Degraded image restoration and fusion: TextIF, DRMF, Text-DiFuse
Diffusion Transformers: DiT, Flux, SD3, and related architectures

Rating¶

Dimension	Score (1–5)
Novelty	4
Technical Depth	4
Experimental Thoroughness	4
Writing Quality	4
Overall	4.0