Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark¶

Conference: CVPR 2026 arXiv: 2603.00543 Code: GitHub Area: LLM Evaluation Keywords: Remote sensing image fusion, cross-scale generalization, Transformer, rotary position encoding, pansharpening

TL;DR¶

This paper proposes PanScale, the first cross-scale pansharpening dataset, along with the PanScale-Bench evaluation benchmark, and the ScaleFormer framework — which reinterprets resolution variation as sequence length variation, achieving cross-scale generalization via Scale-Aware Patchify bucketed sampling, decoupled spatial-sequence modeling, and RoPE.

Background & Motivation¶

Background: Pansharpening fuses high-resolution panchromatic (PAN) images with low-resolution multispectral (LRMS) images to produce high-resolution multispectral (HRMS) images, a core task in remote sensing image processing. CNN/Transformer-based methods (MSDCNN, HFIN, ARConv, etc.) have achieved substantial progress.

Limitations of Prior Work: (i) Computational and memory bottlenecks — scaling from training crop sizes (200–256 px) to inference at 800/1600/2000 px causes Transformer memory to explode, with standard GPUs frequently running out of memory at 800 px; (ii) Patch inference artifacts — forced patch-based inference introduces boundary discontinuities and visible block artifacts; (iii) Weak cross-scale generalization — training at a single low resolution induces scale-induced distribution shift, with brightness distributions shifting noticeably as resolution increases.

Key Challenge: Existing datasets (PanCollection, NBU, PAirMax) provide only limited scale diversity and resolution range, lacking a standardized multi-scale, high-resolution evaluation protocol.

Goal: Systematically address the challenges of cross-scale pansharpening across three dimensions: data, algorithm, and computation.

Key Insight: Reframe resolution variation as sequence length variation — spatial patch size is fixed, so only the sequence length grows linearly with image scale.

Core Idea: Introduce a sequence axis via Scale-Aware Patchify, decouple spatial modeling from scale modeling, and leverage RoPE to extrapolate to unseen scales.

Method¶

Overall Architecture¶

ScaleFormer consists of three core components: 1. Scale-Aware Patchify (SAP): A bucketed window sampling strategy. 2. Single Transformer module: Spatial Transformer (spatial-domain modeling) + Sequence Transformer (sequence/scale-domain modeling). 3. Cross Transformer module: Spatial-Cross + Sequence-Cross Transformer for cross-modal feature fusion.

Given input PAN image $\mathbf{P} \in \mathbb{R}^{H \times W \times 1}$ and upsampled MS image $\mathbf{L} \in \mathbb{R}^{H \times W \times C}$, SAP converts them into 5D tensors $\mathbf{P}_{5d} \in \mathbb{R}^{B \times T \times C \times h \times w}$, where $T$ is the sequence length.

Key Designs¶

Scale-Aware Patchify (SAP): During training, a bucket index $t$ is randomly sampled to determine window size $w(t)$; a Patch-to-Sequence Tokenizer partitions the input into token sequences of varying lengths, exposing the model to a range of effective sequence lengths. At inference, a fixed window size is used, and higher resolutions are handled solely by extending the sequence. The key effect is to prevent mean and variance drift, stabilizing per-token statistics.
Decoupled Spatial-Sequence Modeling: The Spatial Transformer models intra-patch spatial relationships: $$\mathbf{f}_{i,1} = \mathbf{f}_i + SA_{spa}(LN(\mathbf{f}_i))$$ The Sequence Transformer models cross-patch correlations along the sequence dimension: $$\mathbf{f}_{i+1,1} = \mathbf{f}_{i+1} + SA_{seq}(LN(\mathbf{f}_{i+1}))$$ where $SA_{seq}$ merges the batch and spatial dimensions during operation, and injects RoPE to encode continuous relative position information, enhancing scale extrapolation capability.
Cross Transformer module: A similar architecture using cross-attention to enable PAN–MS cross-modal interaction: $$\mathbf{f}_{i,1}^{ms} = \mathbf{f}_i^{ms} + CA_{spa}(LN(\mathbf{f}_i^{ms}), LN(\mathbf{f}^{pan}))$$

Loss & Training¶

An L1 loss $\mathbf{L} = \|\mathbf{H}_{out} - \mathbf{G}\|_1$ is used. The model is trained with the Adam optimizer, initial learning rate $5 \times 10^{-4}$, cosine annealing decay to $5 \times 10^{-8}$, for 500 epochs on an NVIDIA 3090, with 32 channels.

Key Experimental Results¶

Main Results: Averaged Results Across Three PanScale Subsets¶

Method	Jilin PSNR/SSIM	Landsat PSNR/SSIM	Skysat PSNR/SSIM
HFIN	38.00/0.9698	40.21/0.9666	43.96/0.9658
ARConv	38.23/0.9697	39.66/0.9638	43.40/0.9797
Pan-mamba	35.55/0.9480	36.73/0.9206	41.39/0.9493
ScaleFormer	39.29/0.9761	41.04/0.9711	44.65/0.9827

ScaleFormer outperforms all SOTA methods across all datasets, with stable performance as resolution increases.

Ablation Study: Landsat Dataset¶

Configuration	200px PSNR	400px PSNR	800px PSNR	1600px PSNR
w/o RoPE	40.46	40.95	40.76	40.69
SeqT→SpaT	40.91	41.30	40.72	40.51
w/o SAP	40.53	40.93	40.62	40.39
Full Model	40.61	41.37	41.13	41.03

All ablated variants exhibit noticeable performance degradation at higher resolutions, confirming that each component is indispensable for cross-scale generalization.

Key Findings¶

ScaleFormer has only 0.52M parameters (1/4 of HFIN, 1/9 of ARConv), with a significant computational efficiency advantage.
GFLOPs and memory usage of ScaleFormer grow substantially more slowly than those of HFIN/ARConv as resolution increases.
ARConv exhibits severe block artifacts under patch-based inference (significant drop in DDC-IoU).
ScaleFormer remains competitive in full-resolution real-world scene evaluation (without ground truth).

Highlights & Insights¶

Elegant problem reformulation: Recasting resolution generalization as sequence length generalization draws on sequence modeling ideas from NLP and video models.
Outstanding computational efficiency: Substantially fewer parameters and GFLOPs than SOTA methods, with advantages widening at higher resolutions.
Dataset contribution: PanScale is the first cross-scale pansharpening dataset covering three satellite platforms (0.5–15 m resolution).
Novel application of RoPE: Adapts RoPE from text/video domains to remote sensing fusion tasks for scale extrapolation.

Limitations & Future Work¶

The approach focuses solely on pansharpening; generalization to other remote sensing fusion tasks (hyperspectral fusion, SAR–optical fusion) remains unvalidated.
The bucketing strategy in SAP uses a predefined fixed set of window sizes; an adaptive strategy may be more effective.
Only L1 loss is employed; perceptual losses or GAN losses may further improve visual quality.
The self-attention in the Sequence Transformer remains $O(T^2)$, which may become a bottleneck for extremely large-scale inputs.

Traditional methods (GS, IHS, GFPCA) perform poorly in cross-scale settings (PSNR over 10 dB lower).
CNN-based methods (MSDCNN, SFINet, MSDDN) offer limited cross-scale generalization.
HFIN/ARConv represent the current SOTA but suffer from severe memory and computational bottlenecks.
Pan-mamba adopts the Mamba architecture but underperforms Transformer-based approaches.
FlexViT's multi-resolution training and bucketed training strategies in video generation inspired the design of SAP.

PanScale Dataset Details¶

Three sub-datasets: Jilin (Jilin-1 satellite, 0.5–1 m resolution), Landsat (Landsat-8, 15 m resolution), and Skysat (Planet SkySat, ~1 m resolution).
Test set design: Each sub-dataset includes both reduced-resolution (200×200 to 2000×2000) and full-resolution multi-scale test sets.
Data source: Acquired and preprocessed via Google Earth Engine (GEE).
Evaluation metrics: PanScale-Bench integrates reference metrics (PSNR/SSIM/ERGAS/Q) and no-reference metrics ($D_\lambda$/$D_S$/QNR).

Efficiency Comparison¶

Method	Parameters (M)	GFLOPs (G)
ARConv	4.4147	38.32
HFIN	1.9836	46.21
ScaleFormer	0.5151	20.57

Rating ⭐¶

Novelty: ⭐⭐⭐⭐ — The resolution-to-sequence-length reformulation is original; the SAP + RoPE combination is effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive coverage across three datasets, multiple scales, full-resolution evaluation, ablation studies, efficiency analysis, and visualization.
Writing Quality: ⭐⭐⭐⭐ — Excellent figure and table design; Fig. 1/2 clearly illustrate the problem and compare solutions.
Value: ⭐⭐⭐⭐⭐ — A unified contribution of dataset, benchmark, and method, advancing the remote sensing fusion field.