Skip to content

Pan-Crafter: Learning Modality-Consistent Alignment for Pan-Sharpening

Conference: ICCV 2025 arXiv: 2505.23367 Code: https://kaist-viclab.github.io/PAN-Crafter_site Area: Remote Sensing / Pan-Sharpening Keywords: Pan-sharpening, cross-modal alignment, modality-adaptive reconstruction, attention mechanism, remote sensing image fusion

TL;DR

PAN-Crafter proposes a modality-consistent alignment framework that explicitly addresses cross-modal misregistration between PAN and MS images via Modality-Adaptive Reconstruction (MARs) and Cross-Modal Misalignment-aware Multi-scale Attention (CM3A), achieving state-of-the-art performance on multiple remote sensing benchmarks while running 1110× faster than diffusion-based methods.

Background & Motivation

Pan-sharpening in remote sensing aims to fuse high-resolution panchromatic (PAN, single-channel) images with low-resolution multispectral (MS, multi-channel) images to produce high-resolution multispectral (HRMS) output—a core requirement in practical satellite image processing.

Core Pain Points: Due to differences in sensor position, acquisition time offsets, and resolution mismatches, cross-modal spatial misregistration exists between PAN and MS images. However:

Existing methods assume perfect alignment: Most deep learning methods employ pixel-wise losses (\(\ell_1\)/\(\ell_2\)), which cause spectral distortion, ghosting artifacts, and blurring in the presence of misregistration.

Adaptive correction lacks flexibility: SIPSA relies on fixed-scale alignment; LAGConv/CANConv use self-similarity aggregation rather than explicit geometric alignment.

Diffusion models yield high quality but are too slow: PanDiff and TMDiff have inference times of 2.955s and 9.997s respectively, making deployment impractical.

Key Insight: Design a bidirectional alignment mechanism—aligning PAN structure to MS texture (for HRMS reconstruction) and, conversely, aligning MS texture to PAN structure (via PAN back-reconstruction as auxiliary self-supervision)—forming a "modality-consistent" constraint.

Method

Overall Architecture

PAN-Crafter adopts a U-Net encoder-decoder architecture built around two core modules: 1. MARs (Modality-Adaptive Reconstruction): a single network jointly learns to reconstruct HRMS and to back-reconstruct PAN. 2. CM3A (Cross-Modal Misalignment-aware Multi-scale Attention): performs multi-scale bidirectional alignment of PAN and MS features.

Key Designs

1. Modality-Adaptive Reconstruction (MARs)

MARs enables a single network to operate alternately in two modes:

  • MS mode: Input PAN + LRMS → Output HRMS $\(\hat{\mathbf{I}}_{\text{ms}}^{\text{hr}} = \mathcal{P}_\theta(\mathbf{I}_{\text{pan}}, \mathbf{I}_{\text{ms}}^{\text{lr}}; \text{mode}=\mathsf{MS}) + \mathbf{I}_{\text{ms}}^{\text{lr}}\)$

  • PAN mode: Input PAN + LRMS → Back-reconstruct PAN (replicated to multi-channel) $\(\hat{\mathbf{I}}_{\text{pan}}^{\text{rep}} = \mathcal{P}_\theta(\mathbf{I}_{\text{pan}}, \mathbf{I}_{\text{ms}}^{\text{lr}}; \text{mode}=\mathsf{PAN}) + \mathbf{I}_{\text{pan}}^{\text{lr,rep}}\)$

Design Motivation: PAN back-reconstruction serves as an auxiliary self-supervised signal. Its key advantage is that the PAN image itself constitutes a readily available ground truth—no additional annotation is required. By compelling the network to reconstruct a sharp PAN image, the network is forced to internalize fine spatial structures, which in turn improves the spatial detail quality of the HRMS output.

During training, each batch is duplicated: one copy follows MS mode and the other follows PAN mode. Only MS mode is used at inference.

MARs loss: $\(\mathcal{L}_{\text{MARs}} = \|\hat{\mathbf{I}}_{\text{ms}}^{\text{hr}} - \mathbf{I}_{\text{ms}}^{\text{hr}}\|_1 + \lambda \|\hat{\mathbf{I}}_{\text{pan}}^{\text{rep}} - \mathbf{I}_{\text{pan}}^{\text{rep}}\|_1\)$

2. Cross-Modal Misalignment-aware Multi-scale Attention (CM3A)

CM3A is the other core contribution of PAN-Crafter, performing bidirectional cross-modal alignment at multi-scale feature levels.

Key Design: Local attention (\(k \times k\) windows, \(k=3\)) replaces global attention, since PAN–MS pairs are typically already roughly pre-aligned and only local correction is needed. This reduces computational complexity from \(O(2(HW)^2 C)\) to \(O(2(HW)k^2 C)\).

In MS mode, queries are constructed from LRMS features, and two attention operations are performed simultaneously: - Self-attention: Query interacts with MS key-value pairs to maintain MS feature consistency. - Alignment attention: Query interacts with PAN key-value pairs to incorporate PAN structural information.

\[\mathbf{x}_{\text{ms}} = \text{LocalAttn}(\mathbf{Q}, \mathbf{K}_{\text{ms}}, \mathbf{V}_{\text{ms}})$$ $$\mathbf{x}_{\text{pan}} = \text{LocalAttn}(\mathbf{Q}, \mathbf{K}_{\text{pan}}, \mathbf{V}_{\text{pan}})\]

In PAN mode, the operations are mirrored—queries are constructed from PAN features to ensure structural consistency in PAN back-reconstruction.

Novelty: Downsampled raw images replace conventional fixed positional encodings by being concatenated into Q/K, allowing the network to implicitly learn relative inter-modal displacement.

3. Modality Modulation

Within each ResBlock, learnable parameters \(\gamma, \beta\) modulate features according to the MARs mode:

\[\mathsf{Modulate}(\mathbf{x}; \mathsf{MS}): \mathbf{x} \leftarrow (1 + \gamma_{\text{ms}}) \odot \mathbf{x} + \beta_{\text{ms}}\]

This ensures that the shared network can adapt to the feature distributions of different modalities.

Loss & Training

  • AdamW optimizer, initial learning rate \(1 \times 10^{-4}\), cosine annealing
  • Batch size 48 (effective batch size 96 due to MARs duplication)
  • 50K training iterations with 100-step warmup
  • \(\lambda = 1.0\), feature dimension \(C = 128\)

Key Experimental Results

Main Results

Performance comparison on the WV3 dataset:

Method HQNR↑ ERGAS↓ PSNR↑ Inference Time↓ (s) Memory↓ (GB)
CANConv (CVPR24) 0.951 2.163 37.441 0.451 2.713
PanDiff (TGRS23) 0.952 2.276 37.029 2.955 2.328
TMDiff (TGRS24) 0.924 2.151 37.477 9.997 9.910
PAN-Crafter 0.958 2.040 37.956 0.009 1.711

PAN-Crafter is 1110× faster than TMDiff and 50× faster than CANConv, while achieving the best results on all core metrics.

GF2 dataset: PSNR reaches 45.076, nearly 2 dB higher than the second-best CANConv (43.166).

Ablation Study

CM3A MARs HQNR↑ ERGAS↓ PSNR↑ Time (s)
0.948 2.232 37.245 0.006
0.956 2.122 37.602 0.009
0.949 2.212 37.285 0.007
0.958 2.040 37.956 0.009

Key findings: - MARs contributes more than CM3A in isolation (PSNR: +0.357 vs. +0.040). - However, combining both yields a synergistic effect—the combined gain (+0.711) substantially exceeds the sum of individual gains.

Key Findings

  1. Zero-shot generalization: On unseen WV2 satellite data, PAN-Crafter achieves an HQNR of 0.942, substantially outperforming competing methods.
  2. Self-supervised effect of MARs: PAN back-reconstruction as an auxiliary task significantly improves spatial sharpness on the primary task.
  3. CM3A + MARs synergy: MARs enables bidirectional interaction, upon which CM3A achieves more precise alignment.
  4. Efficiency: The entire framework has only 7.17M parameters and 79.03G FLOPs.

Highlights & Insights

  • PAN back-reconstruction as self-supervision: The method cleverly leverages the PAN image itself as a free supervisory signal, requiring no additional annotation.
  • Local attention replacing positional encoding: Using raw image features instead of conventional positional encodings is better suited to handling uncertain spatial misregistration.
  • High practical value: The method comprehensively outperforms diffusion-based models while being three orders of magnitude faster, making it well-suited for real-world remote sensing processing pipelines.

Limitations & Future Work

  • Inter-band misregistration among multispectral channels is not explicitly addressed; the current work only handles misregistration between PAN and MS.
  • Depthwise separable convolutions for inter-band alignment are identified as a potential improvement direction (noted in the paper).
  • The local attention window size \(k=3\) may be insufficient for large misregistrations.
  • CANConv [CVPR 2024]: Clustering-based spatially adaptive convolution, but k-means dependency results in slow inference.
  • SIPSA [2022]: First work to identify misregistration as a critical challenge in pan-sharpening.
  • PanDiff / TMDiff: Diffusion-based methods with high output quality but prohibitive inference cost.
  • Insight: The dual-task self-supervision paradigm (primary task + inverse task) is generalizable to other cross-modal fusion problems.

Rating

  • Novelty: ⭐⭐⭐⭐ — The MARs bidirectional reconstruction and CM3A alignment mechanism are novel in design.
  • Technical Depth: ⭐⭐⭐⭐ — Modality modulation, local attention as a PE replacement, and other design details are carefully considered.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four satellite datasets, zero-shot generalization, and efficiency comparisons.
  • Practicality: ⭐⭐⭐⭐⭐ — Fast inference, low memory footprint, and strong performance make it highly suitable for real-world deployment.