Balancing Task-Invariant Interaction and Task-Specific Adaptation for Unified Image Fusion¶

Conference: ICCV 2025 arXiv: 2504.05164 Code: github.com/huxingyuabc/TITA Area: Image Fusion Keywords: Unified Image Fusion, Multi-task Learning, Pixel Attention, Adaptive Fusion, Gradient Conflict

TL;DR¶

TITA proposes a unified image fusion framework that requires no task identifier at inference. It employs an Interaction-enhanced Pixel Attention (IPA) module to explore task-invariant complementary information extraction, an Operation-based Adaptive Fusion (OAF) module to dynamically adapt to task-specific requirements, and the FAMO strategy to mitigate multi-task gradient conflicts.

Background & Motivation¶

Image fusion aims to integrate complementary information from multi-source images to enhance image quality, covering diverse scenarios such as infrared-visible fusion (IVF), multi-exposure fusion (MEF), and multi-focus fusion (MFF). Existing methods face three core challenges:

Unified methods neglect task specificity: Existing unified fusion methods (e.g., U2Fusion, PMGI) treat different fusion tasks as a single problem using shared architectures and unified objective functions. Although they achieve task-invariant knowledge sharing, they ignore the distinct physical characteristics of each task (e.g., IVF emphasizes thermal infrared saliency, MEF balances luminance, MFF extracts in-focus regions), thereby limiting overall performance.

General methods rely on task identifiers: General fusion methods (e.g., SwinFusion, TC-MoA) introduce task-specific adaptation but require explicit task identifiers at inference to select corresponding model branches or loss functions, which limits generalization to unseen tasks.

Multi-task gradient conflicts: The optimization directions of different fusion tasks may be mutually contradictory, and naively averaging gradients can degrade performance on certain tasks.

This paper seeks to simultaneously address all three challenges: how to leverage the commonality across fusion tasks (complementary information extraction) while adapting to task-specific characteristics, without relying on task identifiers?

Method¶

Overall Architecture¶

TITA is built upon the SwinFusion architecture and consists of two stages: 1. Task-invariant interaction stage: \(L\) stacked Interaction-enhanced SwinFusion (ISF) modules, each containing one IPA module. 2. Task-specific adaptation stage: The Operation-based Adaptive Fusion (OAF) module.

Key Designs¶

Interaction-enhanced Pixel Attention (IPA) Module:
- An improvement over the Pixel Attention (PA) mechanism, where PA dynamically modulates the weights of self-attention and cross-attention via a relation discriminator \(\phi_{\theta_s}(\cdot)\).
- Two key modifications in IPA:
  - Removal of direct noise injection on V: The noise injection on Value in PA may cause irreversible information loss.
  - Reinforced cross-attention preference: When the relation discriminator score is high (i.e., the two sources are more uncorrelated), the cross-attention weight is explicitly increased. The key construction becomes: \(K_1 = [(N_1 + X_{i,1} - X_{i,1}\phi_{\theta_s})W_K, (N_2 + X_{i,1} + X_{i,1}\phi_{\theta_s})W_K]\)
- Design motivation: A higher uncorrelatedness score implies more complementary information to be extracted via cross-attention.
- All intra-domain fusion blocks in SwinFusion are replaced with inter-domain fusion (ISF), further increasing the number of cross-attention operations.
Operation-based Adaptive Fusion (OAF) Module:
- Three parallel operation branches:
  - HPF branch: Spatially varying high-pass filtering to capture high-frequency textures and edges.
  - ADD branch: Residual addition for overall information enhancement.
  - MUL branch: Element-wise multiplication to facilitate nonlinear feature interaction.
- The operands (e.g., convolution kernels) of each branch are predicted from input features by a hypernetwork (2-layer MLP).
- The dynamic weights \(W\) for the three branches are predicted by a separate weight prediction network from dual-source features \((X_1, X_2)\).
- Output: \(X_f = \sum_{o \in \{h,a,m\}} W_1 \cdot \hat{X}_{o,1} + W_2 \cdot \hat{X}_{o,2}\)
- Design motivation: Different fusion tasks have different demands for high-frequency preservation, overall enhancement, and nonlinear combination. Dynamic weights enable automatic adaptation to task characteristics without explicit task identifiers.
FAMO Multi-objective Optimization Strategy:
- Experiments reveal severe gradient conflicts (large angular and magnitude discrepancies) when directly averaging multi-task gradients.
- FAMO is adopted: learnable logits \(\xi_t\) generate softmax weights to dynamically adjust the loss weight of each task.
- FAMO equalizes the loss reduction rates across tasks to achieve fair optimization.
- Design motivation: The introduction of the TA module exacerbates gradient conflicts; FAMO effectively alleviates this issue (ablation studies confirm that TA benefits most from MO).

Loss & Training¶

Task-specific objectives are used (task identifiers are required during training but not at inference):

\[\ell = \lambda_1 \ell_{ssim} + \lambda_2 \ell_{text} + \lambda_3 \ell_{int}\]

where \(\ell_{text}\) (texture loss) applies maximum gradient constraints, and \(\ell_{int}\) (intensity loss) uses task-specific aggregation (max for IVF/MFF, mean for MEF). Training configuration: Adam optimizer (lr=2e-5), batch size 8, 20,000 iterations, with uniform sampling across tasks.

Key Experimental Results¶

Main Results — Quantitative Comparison on Three Fusion Tasks¶

IVF Task (LLVIP dataset):

Method	Type	MI ↑	FMI ↑	Qabf ↑	VIF ↑
SwinFusion	General	3.873	0.889	0.650	0.907
TC-MoA	General	3.606	0.886	0.600	0.925
Text-IF	Dedicated	3.322	0.892	0.684	0.932
CCF	Unified	2.789	0.881	0.499	0.719
TITA	Unified	4.176	0.896	0.679	0.926

MFF Task (Lytro+MFFW+MFI-WHU):

Method	Type	MI ↑	FMI ↑	Qabf ↑	SSIM ↑
IFCNN	Dedicated (MFF)	6.495	0.882	0.658	0.991
SwinFusion	General	6.261	0.881	0.687	0.991
TITA	Unified	6.546	0.885	0.697	0.993

Without using task identifiers, TITA surpasses dedicated and general methods on multiple metrics.

Ablation Study — Contribution of Three Components (IVF)¶

TI	TA	MO	MI ↑	FMI ↑	Qabf ↑	VIF ↑
✗	✗	✗	3.612	0.889	0.646	0.845
✓	✗	✗	3.882	0.892	0.664	0.904
✗	✓	✗	3.685	0.891	0.662	0.853
✗	✗	✓	3.680	0.891	0.662	0.853
✓	✓	✗	3.883	0.893	0.666	0.906
✓	✗	✓	4.122	0.895	0.676	0.919
✓	✓	✓	4.176	0.896	0.680	0.926

All three components are mutually reinforcing; MO yields the largest gain for TA, confirming that TA introduces gradient conflicts that must be mitigated by MO.

Key Findings¶

Generalization to unseen tasks: TITA performs well on unseen tasks such as medical image fusion (MIF) and pansharpening (PAN), while FusionDN and CCF completely collapse on unseen tasks.
More cross-attention is better: IeSF (all inter-domain) > SF (original) > IrSF (all intra-domain), validating the importance of multi-source interaction.
The MUL branch is the most critical in OAF: Removing the MUL branch causes the largest performance drop, as image fusion inherently involves extensive nonlinear operations.
Dynamic weight visualization: OAF automatically assigns different operation weight distributions to different fusion tasks (e.g., HPF weights are smaller but indispensable in the MFF task).

Highlights & Insights¶

The paper accurately identifies the three-fold challenge of unified fusion frameworks (task invariance, task specificity, gradient conflict) and provides a systematic solution.
The causal design in IPA — "higher uncorrelatedness → larger cross-attention weight" — is intuitively sound: regions with stronger complementarity require more cross-modal interaction.
Only 1.39M parameters, making the framework lightweight and efficient.
The design requiring no task identifier at inference allows the framework to be directly applied to any new fusion task.

Limitations & Future Work¶

Only three operation branches (HPF, ADD, MUL) are included in OAF, which may be insufficient to cover all fusion task requirements.
Training data is imbalanced across tasks (IVF: 12,025 pairs vs. MFF: 800 pairs); although uniform sampling is applied, its effectiveness remains limited.
Compared to recent methods leveraging diffusion models or large language models (e.g., DDFM, Text-IF), there remains a gap in perceptual quality.

Relation to SwinFusion: TITA inherits its architecture but extends the general method into a unified framework that requires no task identifier.
Relation to TC-MoA: TC-MoA uses a mixture of experts to adapt to different tasks but requires task identifiers; TITA's OAF implicitly achieves similar functionality through dynamic weights.
FAMO, as a multi-task optimization strategy, is generalizable and can be extended to other multi-task vision systems.

Rating¶

Novelty: ⭐⭐⭐⭐ (Systematic solution to three-fold challenges; well-motivated component designs)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Three tasks + two unseen tasks + detailed ablations)
Writing Quality: ⭐⭐⭐⭐ (Clear structure; motivations are thoroughly articulated)
Value: ⭐⭐⭐⭐ (Advances the state of unified image fusion; convincing generalization validation)