VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion¶

Conference: CVPR 2026 arXiv: 2503.23359 Code: https://github.com/Linfeng-Tang/VideoFusion Area: Image Fusion / Video Processing Keywords: multi-modal video fusion, infrared-visible, temporal consistency, cross-modal, dataset

TL;DR¶

This paper introduces the M3SVD large-scale infrared-visible video dataset (220 videos / 150K frames) and proposes the VideoFusion framework, which achieves spatio-temporal collaborative multi-modal video fusion via a Cross-modal Differential Reinforcement Module (CmDRM), Complete Modal Guided Fusion (CMGF), Bidirectional Co-Attention Module (BiCAM), and a variational consistency loss. The method surpasses existing image fusion and video fusion approaches in both fusion quality and temporal consistency.

Background & Motivation¶

Multi-sensor fusion research has primarily focused on static image fusion, with a large body of methods employing AE/CNN/GAN/Transformer/Diffusion architectures for infrared-visible image fusion. However, sensors in real-world scenarios capture continuous video streams; applying image fusion methods frame-by-frame ignores inter-frame temporal dependencies, leading to flickering artifacts and temporal inconsistency. The main bottlenecks are the lack of large-scale temporally aligned multi-modal video datasets (existing benchmarks such as TNO contain only 3 videos/114 frames, INO suffers from low resolution, and HDO from poor quality), and the challenge of jointly modeling spatial and temporal dependencies within a unified framework.

Core Problem¶

How to simultaneously exploit cross-modal complementary information and inter-frame temporal dependencies in multi-modal video fusion to generate high-quality, temporally consistent fused videos?

Method¶

Overall Architecture¶

An encoder-decoder architecture is adopted. During encoding: 3D convolutions extract shallow temporal features → downsampling + ResBlocks + CmDRM extract multi-scale enhanced features → CMGF aggregates cross-modal context at multiple scales. During decoding: Transformer enhancement blocks → BiCAM establishes cross-frame temporal dependencies → a fusion decoder with a modal separation module reconstructs the fused video and the restored infrared/visible modality videos respectively.

Key Designs¶

Cross-modal Differential Reinforcement Module (CmDRM): The core idea is to extract cross-modal differential information (complementary rather than redundant). A differential feature \(\mathcal{F}_d^t = \mathcal{F}_{ir}^t - \mathcal{F}_{vi}^t\) is computed and used as Key/Value, while the original modality features serve as Query for cross-attention. A learnable contribution metric then adaptively fuses the original and differential-enhanced features, followed by channel and spatial attention refinement.
Complete Modal Guided Fusion (CMGF): The sum of dual-modality features \(\mathcal{F}_c^t = \hat{\mathcal{F}}_{ir}^t + \hat{\mathcal{F}}_{vi}^t\) serves as a shared Query to extract modality-specific information from infrared and visible features respectively, enabling guided cross-modal aggregation.
Bidirectional Co-Attention Module (BiCAM): The current frame feature acts as a shared Query, while features from the previous and next frames generate K/V pairs. After computing forward and backward attention, bidirectional temporal dynamics are fused via \(\mathcal{A}_{co} = \text{softmax}(\mathcal{A}_{t-1} * \mathcal{A}_{t+1})\). Analogous to the shifted window mechanism in Swin Transformer, stacking \(N\) BiCAM modules allows the model to indirectly capture long-range temporal context.
Variational Consistency Loss \(\mathcal{L}_{var}\): Based on the assumption that inter-frame variation in static backgrounds should approach zero, while inter-frame variation in dynamic objects should remain consistent with the source video. The loss constrains inter-frame differences in both the fused video and the restored modality videos to be consistent with the high-quality source video.

Loss & Training¶

\(\mathcal{L}_{int}\): Intensity loss, preserving salient targets from the source video
\(\mathcal{L}_{grad}\): Gradient loss, preserving textural details
\(\mathcal{L}_{color}\): CbCr color loss, maintaining color fidelity
\(\mathcal{L}_{sf}\): Scene fidelity loss, constraining the quality of modal separation reconstruction
\(\mathcal{L}_{var}\): Variational consistency loss, suppressing temporal flickering

Key Experimental Results¶

M3SVD (Degraded)	MI↑	SSIM↑	VIF↑	flowD↓
U2Fusion	2.490	0.600	0.439	6.547
TemCoCo	3.548	0.597	0.490	4.378
VideoFusion	4.008	0.632	0.526	3.294

M3SVD (Normal)	MI↑	SSIM↑	VIF↑	flowD↓
TC-MoA	2.894	0.602	0.577	5.305
TemCoCo	3.548	0.597	0.490	4.379
VideoFusion	4.191	0.646	0.605	3.494

Temporal consistency: VideoFusion achieves the lowest flowD (3.294/3.494); frame-by-frame methods such as DDFM/LRRNet exhibit flowD > 6.
Efficiency: 6.743M parameters, 267.78G FLOPs, 0.067s/frame — comparable to image fusion methods.
Downstream task: YOLO v11 detects more objects and produces smoother trajectories on VideoFusion outputs.

Ablation Study¶

Removing BiCAM: flowD increases from 3.294 → 4.747, with significant degradation in temporal consistency.
Removing CmDRM: information recovery capability declines; MI drops from 4.008 → 3.557.
Removing CMGF (replaced by simple addition): SSIM drops from 0.632 → 0.366, causing severe structural distortion.
Removing \(\mathcal{L}_{var}\): flowD increases from 3.294 → 6.056.
Removing \(\mathcal{L}_{color}\): pronounced color distortion is observed.

Highlights & Insights¶

The M3SVD dataset is a major contribution: 220 videos / 150K frames / 100 scenes / 4 types of challenging scenarios — an order of magnitude larger than the previously largest dataset HDO (24 videos / 7,500 frames).
The bidirectional co-attention and stacking design of BiCAM is elegant and effective, requiring neither optical flow nor DCN for inter-frame alignment.
The static/dynamic decomposition assumption underlying the variational consistency loss is well-motivated.
The modal separation (unmixing) branch simultaneously yields restored modality outputs — a single model performs both fusion and degradation restoration.
CmDRM uses differential information rather than raw cross-modal features for attention, effectively reducing redundancy.

Limitations & Future Work¶

Training is limited to \(T=7\) frames due to GPU memory constraints; larger temporal windows may further improve performance.
BiCAM only attends to adjacent frames (±1), which may be insufficient for fast-motion scenes — although stacking can indirectly enlarge the temporal receptive field.
The resolution (640×480) and frame rate (30 fps) of M3SVD are relatively modest.
Validation is limited to infrared-visible fusion; other modalities such as multispectral and SAR remain unexplored.
Quantitative metrics (EN/MI/SD/SSIM/VIF) primarily assess pixel-level quality, lacking semantic-level evaluation.

vs. TemCoCo: TemCoCo employs DCN for inter-frame compensation, exhibiting poor generalization on multi-modal data; VideoFusion uses attention mechanisms for adaptive aggregation and outperforms TemCoCo across MI/VIF/flowD.
vs. RCVS: RCVS relies on handcrafted features for temporal modeling; VideoFusion performs end-to-end learning, achieving superior fusion quality and temporal consistency.
vs. image fusion methods (SwinFusion/DDFM, etc.): Frame-by-frame application causes temporal flickering (flowD > 5); VideoFusion significantly reduces this to 3.3.

The video-level fusion framework is generalizable to other multi-modal fusion scenarios (e.g., medical image sequences, remote sensing time series). The M3SVD dataset can serve as a unified benchmark for video fusion, registration, and degradation restoration.

Rating¶

Novelty: ⭐⭐⭐⭐ First systematic video fusion framework with a large-scale dataset, though individual module designs are relatively straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers two datasets, normal and degraded conditions, comprehensive ablation, efficiency analysis, downstream tasks, and temporal visualization.
Writing Quality: ⭐⭐⭐⭐ Clear framework description, detailed dataset construction, and rich figures and tables.
Value: ⭐⭐⭐⭐⭐ M3SVD fills a critical gap; VideoFusion advances the field from image fusion to video fusion with strong practical significance.