Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared¶

Conference: CVPR 2026 arXiv: 2603.08018 Code: https://github.com/harukiv/DCMIF Area: Interpretability Keywords: Infrared-visible fusion, missing modality, convolutional dictionary learning, coefficient-domain inference, large language model prior

TL;DR¶

This paper proposes the first framework that performs cross-modal fusion under missing infrared conditions in the coefficient domain rather than the pixel domain. By learning a shared convolutional dictionary that establishes a unified IR-VIS atomic space, the method performs VIS→IR inference and adaptive fusion entirely in the coefficient domain. A frozen LLM provides weak semantic priors for thermal information completion. The approach achieves performance comparable to dual-modality fusion methods using only visible light images as input.

Background & Motivation¶

Infrared-visible (IR-VIS) image fusion is critical for robust perception in surveillance, robotics, and autonomous driving systems. Existing methods (CNN, CNN-Transformer, GAN, diffusion models) assume both modalities are available at both training and inference time. In practice, however, the infrared modality is frequently absent (e.g., only a visible camera is available at test time).

When infrared is missing, a straightforward approach is to generate pseudo-infrared images in pixel space before fusion. However, pixel-space generation suffers from severe drawbacks: poor controllability, weak interpretability, and susceptibility to hallucination artifacts and loss of structural detail.

Key Challenge: How can thermal information be stably recovered and interpretable fusion performed when infrared is absent? The paper's Key Insight is: rather than generating infrared in pixel space, both modalities are mapped into a unified dictionary-coefficient space, where inference and fusion are performed entirely in the coefficient domain, thereby anchoring data consistency and prior constraints at the atom-coefficient level.

Method¶

Overall Architecture¶

The complete pipeline forms a closed-loop Encode → Transfer → Fuse → Reconstruct process: 1. JSRL (Joint Shared Representation Learning): Learns a shared convolutional dictionary \(\mathbf{D}\) for both IR and VIS, mapping both modalities into a unified atomic space. 2. VGII (Visible-Guided Infrared Inference): Transfers VIS coefficients to pseudo-IR coefficients in the coefficient domain, with a single-step closed-loop refinement guided by weak semantic priors from a frozen LLM. 3. AFRI (Adaptive Fusion via Representation Inference): Fuses VIS and inferred IR coefficients at the atom level via windowed attention and convolution hybrid blocks, reconstructing the final image using the shared dictionary.

Key Designs¶

JSRL — Joint Shared Representation Learning:
- Function: Learns a cross-modal shared dictionary \(\mathbf{D} \in \mathbb{R}^{B \times k \times k}\) such that both VIS and IR can be represented as \(\mathbf{I} = \mathbf{D} * \mathbf{S}\).
- Mechanism: Jointly minimizes reconstruction error for both modalities, coefficient priors, and dictionary regularization: \(\min_{\mathbf{D},\mathbf{S}_{vis},\mathbf{S}_{ir}} \frac{1}{2}\|\mathbf{I}_{vis} - \mathbf{D}*\mathbf{S}_{vis}\|_F^2 + \frac{1}{2}\|\mathbf{I}_{ir} - \mathbf{D}*\mathbf{S}_{ir}\|_F^2 + \lambda_1\varphi_1(\mathbf{S}_{vis}) + \lambda_2\varphi_2(\mathbf{S}_{ir}) + \lambda_3\phi(\mathbf{D})\)
- Implemented via model-driven unfolding: alternating data consistency steps (frequency-domain Sherman-Morrison formula) and proximal update steps (learnable proxies CoeNet/DicNet).
- Architecture: \(N\) cascaded IV-DLBs (Infrared-Visible Dictionary Learning Blocks), each containing two coefficient solvers and one dictionary solver, with hyperparameters adaptively predicted by HypNet.
- Design Motivation: The shared dictionary establishes atom-level correspondences between the two modalities, providing an interpretable unified representation space for subsequent coefficient-domain inference.
VGII — Visible-Guided Infrared Inference:
- Function: Infers pseudo-IR coefficients \(\mathbf{S}_{p\_ir}\) from VIS coefficients \(\tilde{\mathbf{S}}_{vis}\).
- Mechanism:
  - A frozen REN (Representation Encoding Network, comprising pretrained HeadNet + CSB + CoeNet) encodes VIS into coefficients.
  - RIN (Representation Inference Network, encoder-decoder + multi-head attention) maps VIS coefficients to pseudo-IR coefficients.
  - LLM weak semantic prior refinement: An initial pseudo-infrared image \(\mathbf{I}_{p\_ir}^{(0)}\) is reconstructed; the image pair {VIS, pseudo-IR} along with a task description is fed as a prompt to a frozen LLM; the extracted text features \(\mathbf{F}_{text}\) modulate the coefficients via FiLM (Feature-wise Linear Modulation): \(\mathbf{S}_{fm} = \gamma \odot \tilde{\mathbf{S}}_{vis} + \beta\); refined coefficients are then obtained by passing through RIN again.
- Loss function: \(\ell_{inf} = \ell_{int} + \ell_{reg} + \ell_{grad}\)
  - Consistency loss \(\ell_{int}\): L1 distance between pseudo-IR and real IR in both image domain and coefficient domain.
  - Thermal regularization \(\ell_{reg}\): emphasizes thermal region alignment via normalized weight maps.
  - Gradient loss \(\ell_{grad}\): preserves edge consistency \(\|\nabla\mathbf{I}_{p\_ir} - \nabla\mathbf{I}_{vis}\|_1\).
- Design Motivation: The LLM does not generate pixels; it acts solely as a "semantic reviewer" providing channel-wise linear modulation, making it lightweight and controllable. Performing inference in the coefficient domain rather than pixel space inherits the interpretability of the dictionary.
AFRI — Adaptive Fusion:
- Function: Fuses VIS coefficients and inferred IR coefficients at the atom level to reconstruct the final image.
- Mechanism: RFN (Representation Fusion Network) employs two cascaded Convolution-Attention Fusion blocks to learn implicit atom-level gating weights \((\mathbf{W}_{vis}, \mathbf{W}_{p\_ir})\): \(\mathbf{S}_f = \mathbf{W}_{vis} \odot \tilde{\mathbf{S}}_{vis} + \mathbf{W}_{p\_ir} \odot \mathbf{S}_{p\_ir}^{(1)}\)
- Fusion loss: \(\ell_f = \|\mathbf{I}_f - \max(\mathbf{I}_{p\_ir}, \mathbf{I}_{vis})\|_1 + \|\nabla\mathbf{I}_f - \max(\nabla\mathbf{I}_{p\_ir}, \nabla\mathbf{I}_{vis})\|_1\)
- Design Motivation: The element-wise max operation encourages the fused result to inherit peak thermal intensity from IR and sharp structural edges from VIS; gating in the coefficient domain allows structure-oriented atoms to favor VIS and thermal-semantic atoms to favor IR.

Loss & Training¶

The three modules are trained sequentially: JSRL → VGII → AFRI. JSRL is trained for 1,000 epochs on MSRS, and the learned dictionary transfers to other datasets; VGII and AFRI are each trained for 10 epochs. Adam optimizer is used with 5×5 dictionary convolutional kernels on two RTX 4090 GPUs.

Key Experimental Results¶

Main Results¶

Method	Input	MSRS AG↑	MSRS EN↑	FLIR AG↑	FLIR EN↑	KAIST AG↑
CDDFuse	IR+VIS	4.818	7.321	5.079	6.766	3.167
EMMA	IR+VIS	4.913	7.333	3.796	6.489	3.083
DCEvo	IR+VIS	4.858	7.298	4.585	6.763	3.229
Ours	VIS only	5.037	7.188	4.518	6.639	4.414

Key Finding: Using only visible light as input, the proposed method surpasses several dual-modality SOTA fusion methods on metrics such as AG (Average Gradient).

Downstream task (M3FD object detection, YOLOv5): Ours mAP = 0.948 vs. SAGE (dual-modality) = 0.956, a negligible gap. Downstream task (FMB semantic segmentation, SegFormer-b5): Ours mIoU = 62.939 vs. LRRNet (dual-modality) = 62.942, essentially on par.

Ablation Study¶

Configuration	Dictionary	LLM	AG↑	CE↓	EI↑	EN↑	SF↑
Model I (Baseline)	✗	✗	3.320	1.452	45.531	6.058	9.238
Model II	✓	✗	4.363	1.046	48.351	6.578	11.936
Model III	✗	✓	4.256	0.619	48.154	6.423	11.175
Ours	✓	✓	4.518	0.596	48.784	6.639	12.554

Key Findings¶

The shared dictionary contributes most to performance gain (Model I→II: AG +31%), validating the effectiveness of the coefficient-domain paradigm.
LLM modulation provides additional semantic enhancement (CE reduced from 1.046 to 0.596), particularly in terms of brightness and contrast.
The two components are complementary, and their combination yields the best results.
Using only VIS input achieves 90%+ of the performance of dual-modality methods.

Highlights & Insights¶

Paradigm Innovation: The first coefficient-domain inference-fusion scheme for the missing infrared problem, avoiding the instability of pixel-space generation.
Clever Use of LLM: The LLM is not used to generate images but solely as a semantic-level FiLM modulator, making it extremely lightweight and effective.
Training Simplicity: No adversarial training or diffusion sampling is required; VGII and AFRI each require only 10 epochs.
Strong Interpretability: All computation is performed in a unified atomic space, with dictionary atoms providing intuitive physical meaning.
Closed-Loop Design: Encoding → inference → fusion → reconstruction all take place within the same dictionary-coefficient space, ensuring representational consistency.

Limitations & Future Work¶

The shared dictionary trained on MSRS is directly transferred; retraining may be required for scenes with large domain gaps (e.g., medical infrared).
LLM processing introduces inference latency, which must be considered in real-time scenarios.
The accuracy ceiling of coefficient-domain inference is bounded by dictionary capacity; very high-resolution images or fine-grained thermal details may suffer information loss.
Only the missing-infrared scenario is validated; missing visible light or other multi-modal combinations are not explored.
The method assumes that VIS images contain sufficient structural cues to infer thermal information; completely dark scenes may cause failures.

Key distinction from dual-modality SOTAs such as CDDFuse and EMMA: the proposed method requires only a single-modality input.
Model-driven unfolding (DKSVD, Learned-CSC) provides the theoretical foundation for dictionary learning.
FiLM modulation (originating from conditional generation) is innovatively applied here for LLM semantic → coefficient-space modulation.
Inspiration: The interpretability-oriented coefficient-domain paradigm is potentially generalizable to other missing-modality tasks (e.g., missing modality in MRI-CT fusion).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The combination of coefficient-domain inference-fusion paradigm and LLM weak priors is entirely original in this field.
Experimental Thoroughness: ⭐⭐⭐⭐ Three fusion datasets + two downstream tasks + comprehensive ablation; cross-dataset generalization analysis is lacking.
Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are rigorous, architectural diagrams are clear, and motivation is well articulated.
Value: ⭐⭐⭐⭐ First work to address missing-infrared fusion; has practical application prospects; the dictionary paradigm is generalizable.