Text-Guided Channel Perturbation and Pretrained Knowledge Integration for Unified Multi-Modality Image Fusion¶

Conference: AAAI2026 arXiv: 2511.12432 Code: To be confirmed Area: Image Fusion / Multi-Modality Keywords: Multi-modality image fusion, unified model, channel perturbation, CLIP text guidance, pretrained knowledge

TL;DR¶

This paper proposes UP-Fusion, a unified multi-modality image fusion framework comprising three modules — Semantic-aware Channel Pruning Module (SCPM), Geometric Affine Modulation (GAM), and CLIP Text-guided Channel Perturbation Module (TCPM) — that employs a single set of weights (trained solely on infrared-visible data) to simultaneously handle both IVIF and medical image fusion tasks, achieving state-of-the-art performance on both.

Background & Motivation¶

Background: Multi-modality image fusion encompasses two major categories: infrared-visible image fusion (IVIF) and medical image fusion (MEIF). Existing methods typically design task-specific models for each category, lacking a unified framework.

Limitations of Prior Work: (1) Task-specific models fail to generalize across modalities — IVIF models perform poorly on medical fusion and vice versa; (2) unified methods either sacrifice fusion quality or require multi-task training data; (3) direct injection of modality features tends to cause modality overfitting.

Key Challenge: How can a single model with a single set of weights simultaneously handle multiple modality combinations while maintaining quality across different fusion tasks?

Goal: To build a unified framework that, trained on only a single task (IVIF), generalizes to other modality fusion tasks.

Key Insight: Reducing modality dependence through channel perturbation (rather than direct feature injection), and leveraging pretrained knowledge (ConvNeXt + CLIP) to provide cross-task generalization capability.

Core Idea: Channel perturbation combined with pretrained knowledge enables "modality-agnostic" unified fusion.

Method¶

Overall Architecture¶

A Transformer encoder-decoder architecture (4 encoding / 4 decoding layers). After the encoder extracts multi-modality features, SCPM performs semantic-aware channel pruning, GAM applies geometric affine modulation, and TCPM conducts text-guided channel perturbation. Training is performed solely on LLVIP (infrared-visible).

Key Designs¶

Semantic-aware Channel Pruning Module (SCPM):
- Function: Guides channel selection using pretrained semantic knowledge.
- Mechanism: An SE-block computes channel importance \(\omega_C\); a pretrained ConvNeXt extracts semantic features mapped to \(\omega_S\); the fused weight is \(\omega_F = \omega_C + \alpha \cdot \sigma(\omega_S)\) (with learnable \(\alpha\)). Top-k selection retains 70% of channels, with a 1×1 convolution expanding back to the original dimension.
- Design Motivation: ConvNeXt's semantic priors facilitate correct channel selection across different modalities, serving as a key enabler of cross-task generalization.
Geometric Affine Modulation Module (GAM):
- Function: Adapts fused features to the geometric characteristics of each modality.
- Mechanism: Global average pooling is applied to raw modality features; two 1×1 convolution layers generate scale \(\gamma\) and shift \(\beta\); the affine transform is applied as \(F_O^M = Fuse^M \cdot (1 + \gamma) + \beta\).
- Design Motivation: Affine transformation rather than direct feature injection is employed to avoid modality overfitting.
Text-guided Channel Perturbation Module (TCPM):
- Function: Guides channel rearrangement using CLIP text features.
- Mechanism: Multi-modality features are concatenated → channel attention selects top 50% → a 1×1 convolution expands to 2× channels → CLIP encodes text → linear mapping → bootstrap weights → channel rearrangement → self-attention (perturbed features as Q, original as K/V).
- Design Motivation: Channel perturbation is less prone to overfitting specific modalities than direct conditioning.

Loss & Training¶

\(L_T = L_{grad} + L_{l_1}\). Training is conducted solely on LLVIP for 100 epochs with resolution 192×192, Adam optimizer, and LR decayed from 0.0001 to 0.00001 via cosine annealing.

Key Experimental Results¶

Main Results (Infrared-Visible Fusion)¶

Method	MSRS \(Q_P\)	MSRS VIF	LLVIP VIF	M3FD VIF
SAGE	0.5210	0.4359	0.3590	0.4110
TDFusion	0.5529	0.4257	0.3577	0.4041
UP-Fusion	0.5671	0.4587	0.3817	0.4582

Medical Fusion (Surpassing Task-Specific Methods)¶

Method	Harvard \(Q_P\)	Harvard VIF
ALMFnet (dedicated)	0.5434	0.3003
UP-Fusion	0.5665	0.3190

Ablation Study¶

Variant	\(Q_P\)	VIF	SSIM
w/o SCPM	0.5343	0.3046	0.2645
w/o TCPM	0.5221	0.3016	0.2824
UP-Fusion	0.5665	0.3190	0.3639

Key Findings¶

Training on IVIF alone is sufficient to surpass dedicated MEIF methods.
TCPM is the most critical module — removing it causes a \(Q_P\) drop of 0.044.
Downstream tasks also achieve top performance: segmentation mIoU 78.28, detection mAP@0.5 0.841.

Highlights & Insights¶

"Train on one task, generalize to many" is the primary contribution: channel perturbation combined with pretrained knowledge achieves genuine modality-agnosticism.
Channel perturbation as a substitute for direct conditioning: rather than directly injecting modality features, the method indirectly influences representations through channel rearrangement, elegantly avoiding overfitting.

Limitations & Future Work¶

CLIP text guidance requires a text description for each fusion task, limiting the degree of automation.
Validation is restricted to infrared-visible and medical fusion; performance on remote sensing fusion tasks such as SAR or multispectral imagery remains unknown.
The 70% retention rate in SCPM and the 50% rate in TCPM are fixed hyperparameters.

vs. EMMA: EMMA achieves unified fusion but requires multi-task training data, whereas UP-Fusion requires only single-task training.
vs. TDFusion: TDFusion employs text-driven fusion but remains modality-specific; UP-Fusion's channel perturbation is more modality-agnostic.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of channel perturbation, CLIP text guidance, and pretrained knowledge enables multi-task generalization from single-task training.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ IVIF (3 datasets) + MEIF (2 datasets) + downstream tasks + detailed ablation studies.
Writing Quality: ⭐⭐⭐⭐ Module design is clearly presented.
Value: ⭐⭐⭐⭐ The unified fusion framework has direct practical value for industrial applications.