CLIPPan: Adapting CLIP as A Supervisor for Unsupervised Pansharpening¶

Conference: AAAI 2026 arXiv: 2511.10896 Code: Jiabo-Liu/CLIPPan Area: Model Compression Keywords: Pansharpening, CLIP, Unsupervised, Vision-Language Model, Remote Sensing

TL;DR¶

This paper proposes CLIPPan, which fine-tunes CLIP in a parameter-efficient manner to understand multispectral/panchromatic/high-resolution multispectral image types and the pansharpening process, then leverages text prompts encoding Wald's protocol as semantic supervision signals to enable full-resolution unsupervised pansharpening without ground truth. CLIPPan operates as a plug-and-play module compatible with arbitrary pansharpening backbone networks.

Background & Motivation¶

Pansharpening fuses spectrally rich multispectral (MS) images with spatially high-resolution panchromatic (PAN) images to produce high-resolution multispectral (HRMS) images, which is critical for remote sensing applications such as urban planning and environmental monitoring.

Core Problem: Existing deep learning methods rely heavily on ground truth (GT) for supervised training, yet GT is unavailable in real full-resolution scenarios. The common practice of training on downsampled simulated data introduces a severe domain gap when applied to real full-resolution images. Unsupervised methods avoid GT dependency but only exploit low-level pixel relationships between the fusion output and inputs, lacking high-level semantic guidance for the fusion objective.

Core Insight: If the model can be informed of "what the fusion target rules are" (e.g., Wald's protocol), high-level semantic supervision can constrain the fusion output to reside in the HRMS domain. CLIP's image-text alignment capability can naturally translate textually described fusion rules into supervision signals.

Method¶

Overall Architecture¶

CLIPPan proceeds in two stages:

Stage I — Vision-Language Alignment: Fine-tunes CLIP to (i) recognize LRMS/PAN/HRMS image types, (ii) understand remote sensing image content, and (iii) understand the pansharpening process (the MS+PAN→HRMS mapping).

Stage II — Language-Guided Unsupervised Pansharpening: Uses the fine-tuned CLIP as a fixed semantic supervisor, combined with low-level visual constraints to train the pansharpening network.

Key Designs¶

1. Parameter-Efficient Fine-Tuning Strategy

To preserve CLIP's strong generalization capability, only 6 lightweight adapter modules are inserted (3 on the visual side + 3 on the text side). Since CLIP's visual encoder is incompatible with the multi-band input of multispectral images, the original RGB input layer is replaced with a learnable convolutional layer.

2. Inter-Modal Contrastive Learning (InterMCL)

Binds each image type to its corresponding semantic space. Fixed text descriptions (rather than content-dependent ones) are used: - MS: "a multispectral image" - PAN: "a panchromatic image" - HRMS: "High-quality reference image adhering to Wald's protocol: spectrally consistent with original data and spatially sharp"

A contrastive loss pulls image-text positive pairs together and pushes negative pairs apart:

\[\mathcal{L}_{\text{inter}} = \frac{1}{3}\sum_{M1,M2}\mathcal{L}_{\text{align}}(F^I_{M1}, F^T_{M2})\]

HRMS images are generated online using the traditional BDSD algorithm since they are unavailable at full resolution.

3. Intra-Modal Contrastive Learning (IntraMCL)

Prevents feature collapse caused by using fixed text descriptions. LRMS/PAN/HRMS images from the same scene are treated as positive samples, while those from different scenes serve as negatives:

\[\mathcal{L}_{\text{intra}} = -\frac{1}{3N}\sum_{i=1}^{3N}\log\frac{\exp(\langle F^{I(i)}_{M1}, F^{I(i)}_{M2}\rangle/\tau_i)}{\sum_{k=1}^{3N}\exp(\langle F^{I(i)}_{M1}, F^{I(j)}_{M1}\rangle/\tau_i)}\]

This ensures feature diversity while promoting domain transfer from natural images to remote sensing images.

4. Fusion-Aware Alignment

Image Fusion Adapters (IFA) and Text Fusion Adapters (TFA) are introduced to learn fusion feature generation from MS+PAN features, aligning with HRMS/Wald's protocol features:

\[\mathcal{L}_{\text{fusion}} = \|F^T_{\text{fuse}} - F^T_{\text{wald}}\|_1 + \|F^I_{\text{fuse}} - F^I_{\text{HRMS}}\|_1\]

5. Direction Vector Semantic Supervision (Core of Stage II)

Directly applying element-wise loss with Wald's protocol text features is infeasible since they are fixed across all images. Instead, the method exploits consistency of feature displacement directions:

\[\mathcal{L}_d = 1 - \frac{1}{2}(\langle \Delta\mathbf{V}^I_{\text{MS}}, \Delta\mathbf{V}^T_{\text{MS}}\rangle + \langle \Delta\mathbf{V}^I_{\text{PAN}}, \Delta\mathbf{V}^T_{\text{PAN}}\rangle)\]

where \(\Delta\mathbf{V}^I_{\text{MS}} = F^I_{\text{out}} - F^I_{\text{MS}}\) is the fusion displacement direction in image space, and \(\Delta\mathbf{V}^T_{\text{MS}} = F^T_{\text{wald}} - F^T_{\text{MS}}\) is the fusion target direction in text space. By penalizing angular deviation between directions in both spaces, the output is semantically guided toward the HRMS domain.

Loss & Training¶

Stage I: \(\mathcal{L}_{s1} = \mathcal{L}_{\text{inter}} + \mathcal{L}_{\text{intra}} + \mathcal{L}_{\text{fusion}}\)

Stage II Low-Level Visual Constraints: - Spectral fidelity: \(\mathcal{L}_{\text{spec}} = \|\downarrow(\mathbf{I}_{\text{out}}) - \mathbf{I}_{\text{MS}}\|_2^2 + 1 - \text{SSIM}\) - Spatial sharpness: \(\mathcal{L}_{\text{spat}} = \|\phi(\mathbf{I}_{\text{out}}) - \mathbf{I}_{\text{PAN}}\|_2^2 + 1 - \text{SSIM}\) - QNR trade-off: \(\mathcal{L}_{\text{QNR}} = (1-D_\lambda)(1-D_s)\) - Pseudo-supervision: \(\mathcal{L}_{\text{ship}}\) (using the output of a SHIP network trained at reduced resolution as reference)

Total Loss: \(\mathcal{L}_{s2} = \mathcal{L}_{\text{spec}} + \mathcal{L}_{\text{spat}} + \mathcal{L}_{\text{QNR}} + \mathcal{L}_{\text{ship}} + \mathcal{L}_d\)

Training setup: GTX-4090, Adam (lr=0.003), batch size=32, 1000 iterations.

Key Experimental Results¶

Main Results¶

Table 1: Quantitative Results at Full and Reduced Resolution (QB and WV3 Datasets)

Method	QB \(D_\lambda\)↓	QB QNR↑	WV3 \(D_\lambda\)↓	WV3 QNR↑
ArbRPN	0.0140	0.9582	0.0271	0.9383
ArbRPN-C	0.0030	0.9691	0.0042	0.9582
LFormer	0.0124	0.9602	0.0253	0.9227
LFormer-C	0.0053	0.9676	0.0049	0.9572
PanMamba	0.0134	0.9592	0.0152	0.9426
PanMamba-C	0.0050	0.9672	0.0051	0.9578

CLIPPan (denoted by the -C suffix) consistently improves all 5 backbone networks. ArbRPN-C reduces spectral distortion \(D_\lambda\) by 79% on QB, and LFormer-C reduces spatial distortion \(D_s\) by approximately 30% on WV3.

Ablation Study¶

Table 2: Ablation of Unsupervised Fusion Losses (WV3 Reduced Resolution, ArbRPN Backbone)

Loss Combination	MPSNR↑	ERGAS↓	SAM↓	Q2n↑
\(\mathcal{L}_{\text{spec}} + \mathcal{L}_{\text{spat}}\)	29.27	8.96	9.17	0.61
\(\mathcal{L}_{\text{unsup}}\)	32.19	5.88	6.66	0.71
\(\mathcal{L}_{\text{unsup}} + \mathcal{L}_d\)	32.37	5.75	6.55	0.74
\(\mathcal{L}_{\text{unsup}} + \mathcal{L}_{\text{ship}} + \mathcal{L}_d\)	34.72	4.49	5.54	0.80

The joint use of the semantic loss \(\mathcal{L}_d\) and pseudo-supervision \(\mathcal{L}_{\text{ship}}\) achieves the best results, improving MPSNR by 5.4 dB.

Table 3: Ablation of CLIP Fine-Tuning Losses — Incrementally adding IntraMCL, InterMCL, and \(\mathcal{L}_1\) each yields consistent improvements.

Table 5: Ablation of Text Descriptions — Wald's protocol text achieves the best balance across all metrics, confirming that precise protocol-based text is critical for semantic supervision.

Key Findings¶

Even without GT, CLIPPan achieves performance close to supervised methods.
Consistent improvements are also observed at reduced resolution, indicating the framework is effective under both supervised and unsupervised settings.
Learnable residual convolutions for multispectral input outperform manual strategies such as PCA, RGB, and GBNIR.

Highlights & Insights¶

Language as Supervision: This work is the first to convert fusion rules such as Wald's protocol into CLIP-based semantic supervision via text prompts, representing an elegant paradigm shift.
Direction Vector Loss: Rather than directly comparing fixed text features, the method compares the consistency of "feature displacement directions before and after fusion," elegantly resolving the invariance issue of text features.
Plug-and-Play Generality: Compatible with 5 different backbone networks with consistent gains, demonstrating high practical value.
Bidirectional Implication: Beyond guiding fusion via protocols, the framework can in principle be used to evaluate protocol validity and even discover new protocols.

Limitations & Future Work¶

The CLIP fine-tuning stage still relies on the BDSD algorithm to generate approximate HRMS labels, introducing additional priors.
Text descriptions are currently hand-crafted fixed templates; learnable prompt tuning warrants further exploration.
Validation is limited to WorldView-3 and QuickBird sensors; generalization to additional sensors remains to be verified.
Pseudo-supervision \(\mathcal{L}_{\text{ship}}\) depends on the quality of the pre-trained SHIP network.

CLIP-Adapter / CoOp / LoRA-CLIP: Parameter-efficient fine-tuning strategies; this work follows the adapter paradigm.
RS-CLIP / GeoCLIP: CLIP adaptation for remote sensing; this work extends the approach to the pansharpening task.
Insights: The proposed paradigm can be generalized to other remote sensing image fusion tasks (e.g., hyperspectral–multispectral fusion); the idea of "using protocol text as supervision" is also applicable to other tasks with well-defined rules but no available GT.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Introducing vision-language models as supervisors for pansharpening represents a strong paradigm innovation.
Technical Depth: ⭐⭐⭐⭐ — The two-stage design is complete and the direction vector loss is elegantly conceived.
Experimental Thoroughness: ⭐⭐⭐⭐ — 5 backbones × 2 datasets, with comprehensive ablations (5 ablation groups).
Writing Quality: ⭐⭐⭐⭐ — Well-structured with well-motivated design choices.