CLIPPan: Adapting CLIP as A Supervisor for Unsupervised Pansharpening¶
Conference: AAAI 2026 arXiv: 2511.10896 Code: Jiabo-Liu/CLIPPan Area: Model Compression Keywords: Pansharpening, CLIP, Unsupervised, Vision-Language Model, Remote Sensing
TL;DR¶
This paper proposes CLIPPan, which fine-tunes CLIP in a parameter-efficient manner to understand multispectral/panchromatic/high-resolution multispectral image types and the pansharpening process, then leverages text prompts encoding Wald's protocol as semantic supervision signals to enable full-resolution unsupervised pansharpening without ground truth. CLIPPan operates as a plug-and-play module compatible with arbitrary pansharpening backbone networks.
Background & Motivation¶
Pansharpening fuses spectrally rich multispectral (MS) images with spatially high-resolution panchromatic (PAN) images to produce high-resolution multispectral (HRMS) images, which is critical for remote sensing applications such as urban planning and environmental monitoring.
Core Problem: Existing deep learning methods rely heavily on ground truth (GT) for supervised training, yet GT is unavailable in real full-resolution scenarios. The common practice of training on downsampled simulated data introduces a severe domain gap when applied to real full-resolution images. Unsupervised methods avoid GT dependency but only exploit low-level pixel relationships between the fusion output and inputs, lacking high-level semantic guidance for the fusion objective.
Core Insight: If the model can be informed of "what the fusion target rules are" (e.g., Wald's protocol), high-level semantic supervision can constrain the fusion output to reside in the HRMS domain. CLIP's image-text alignment capability can naturally translate textually described fusion rules into supervision signals.
Method¶
Overall Architecture¶
CLIPPan proceeds in two stages:
Stage I — Vision-Language Alignment: Fine-tunes CLIP to (i) recognize LRMS/PAN/HRMS image types, (ii) understand remote sensing image content, and (iii) understand the pansharpening process (the MS+PAN→HRMS mapping).
Stage II — Language-Guided Unsupervised Pansharpening: Uses the fine-tuned CLIP as a fixed semantic supervisor, combined with low-level visual constraints to train the pansharpening network.
Key Designs¶
1. Parameter-Efficient Fine-Tuning Strategy
To preserve CLIP's strong generalization capability, only 6 lightweight adapter modules are inserted (3 on the visual side + 3 on the text side). Since CLIP's visual encoder is incompatible with the multi-band input of multispectral images, the original RGB input layer is replaced with a learnable convolutional layer.
2. Inter-Modal Contrastive Learning (InterMCL)
Binds each image type to its corresponding semantic space. Fixed text descriptions (rather than content-dependent ones) are used: - MS: "a multispectral image" - PAN: "a panchromatic image" - HRMS: "High-quality reference image adhering to Wald's protocol: spectrally consistent with original data and spatially sharp"
A contrastive loss pulls image-text positive pairs together and pushes negative pairs apart:
HRMS images are generated online using the traditional BDSD algorithm since they are unavailable at full resolution.
3. Intra-Modal Contrastive Learning (IntraMCL)
Prevents feature collapse caused by using fixed text descriptions. LRMS/PAN/HRMS images from the same scene are treated as positive samples, while those from different scenes serve as negatives:
This ensures feature diversity while promoting domain transfer from natural images to remote sensing images.
4. Fusion-Aware Alignment
Image Fusion Adapters (IFA) and Text Fusion Adapters (TFA) are introduced to learn fusion feature generation from MS+PAN features, aligning with HRMS/Wald's protocol features:
5. Direction Vector Semantic Supervision (Core of Stage II)
Directly applying element-wise loss with Wald's protocol text features is infeasible since they are fixed across all images. Instead, the method exploits consistency of feature displacement directions:
where \(\Delta\mathbf{V}^I_{\text{MS}} = F^I_{\text{out}} - F^I_{\text{MS}}\) is the fusion displacement direction in image space, and \(\Delta\mathbf{V}^T_{\text{MS}} = F^T_{\text{wald}} - F^T_{\text{MS}}\) is the fusion target direction in text space. By penalizing angular deviation between directions in both spaces, the output is semantically guided toward the HRMS domain.
Loss & Training¶
Stage I: \(\mathcal{L}_{s1} = \mathcal{L}_{\text{inter}} + \mathcal{L}_{\text{intra}} + \mathcal{L}_{\text{fusion}}\)
Stage II Low-Level Visual Constraints: - Spectral fidelity: \(\mathcal{L}_{\text{spec}} = \|\downarrow(\mathbf{I}_{\text{out}}) - \mathbf{I}_{\text{MS}}\|_2^2 + 1 - \text{SSIM}\) - Spatial sharpness: \(\mathcal{L}_{\text{spat}} = \|\phi(\mathbf{I}_{\text{out}}) - \mathbf{I}_{\text{PAN}}\|_2^2 + 1 - \text{SSIM}\) - QNR trade-off: \(\mathcal{L}_{\text{QNR}} = (1-D_\lambda)(1-D_s)\) - Pseudo-supervision: \(\mathcal{L}_{\text{ship}}\) (using the output of a SHIP network trained at reduced resolution as reference)
Total Loss: \(\mathcal{L}_{s2} = \mathcal{L}_{\text{spec}} + \mathcal{L}_{\text{spat}} + \mathcal{L}_{\text{QNR}} + \mathcal{L}_{\text{ship}} + \mathcal{L}_d\)
Training setup: GTX-4090, Adam (lr=0.003), batch size=32, 1000 iterations.
Key Experimental Results¶
Main Results¶
Table 1: Quantitative Results at Full and Reduced Resolution (QB and WV3 Datasets)
| Method | QB \(D_\lambda\)↓ | QB QNR↑ | WV3 \(D_\lambda\)↓ | WV3 QNR↑ |
|---|---|---|---|---|
| ArbRPN | 0.0140 | 0.9582 | 0.0271 | 0.9383 |
| ArbRPN-C | 0.0030 | 0.9691 | 0.0042 | 0.9582 |
| LFormer | 0.0124 | 0.9602 | 0.0253 | 0.9227 |
| LFormer-C | 0.0053 | 0.9676 | 0.0049 | 0.9572 |
| PanMamba | 0.0134 | 0.9592 | 0.0152 | 0.9426 |
| PanMamba-C | 0.0050 | 0.9672 | 0.0051 | 0.9578 |
CLIPPan (denoted by the -C suffix) consistently improves all 5 backbone networks. ArbRPN-C reduces spectral distortion \(D_\lambda\) by 79% on QB, and LFormer-C reduces spatial distortion \(D_s\) by approximately 30% on WV3.
Ablation Study¶
Table 2: Ablation of Unsupervised Fusion Losses (WV3 Reduced Resolution, ArbRPN Backbone)
| Loss Combination | MPSNR↑ | ERGAS↓ | SAM↓ | Q2n↑ |
|---|---|---|---|---|
| \(\mathcal{L}_{\text{spec}} + \mathcal{L}_{\text{spat}}\) | 29.27 | 8.96 | 9.17 | 0.61 |
| \(\mathcal{L}_{\text{unsup}}\) | 32.19 | 5.88 | 6.66 | 0.71 |
| \(\mathcal{L}_{\text{unsup}} + \mathcal{L}_d\) | 32.37 | 5.75 | 6.55 | 0.74 |
| \(\mathcal{L}_{\text{unsup}} + \mathcal{L}_{\text{ship}} + \mathcal{L}_d\) | 34.72 | 4.49 | 5.54 | 0.80 |
The joint use of the semantic loss \(\mathcal{L}_d\) and pseudo-supervision \(\mathcal{L}_{\text{ship}}\) achieves the best results, improving MPSNR by 5.4 dB.
Table 3: Ablation of CLIP Fine-Tuning Losses — Incrementally adding IntraMCL, InterMCL, and \(\mathcal{L}_1\) each yields consistent improvements.
Table 5: Ablation of Text Descriptions — Wald's protocol text achieves the best balance across all metrics, confirming that precise protocol-based text is critical for semantic supervision.
Key Findings¶
- Even without GT, CLIPPan achieves performance close to supervised methods.
- Consistent improvements are also observed at reduced resolution, indicating the framework is effective under both supervised and unsupervised settings.
- Learnable residual convolutions for multispectral input outperform manual strategies such as PCA, RGB, and GBNIR.
Highlights & Insights¶
- Language as Supervision: This work is the first to convert fusion rules such as Wald's protocol into CLIP-based semantic supervision via text prompts, representing an elegant paradigm shift.
- Direction Vector Loss: Rather than directly comparing fixed text features, the method compares the consistency of "feature displacement directions before and after fusion," elegantly resolving the invariance issue of text features.
- Plug-and-Play Generality: Compatible with 5 different backbone networks with consistent gains, demonstrating high practical value.
- Bidirectional Implication: Beyond guiding fusion via protocols, the framework can in principle be used to evaluate protocol validity and even discover new protocols.
Limitations & Future Work¶
- The CLIP fine-tuning stage still relies on the BDSD algorithm to generate approximate HRMS labels, introducing additional priors.
- Text descriptions are currently hand-crafted fixed templates; learnable prompt tuning warrants further exploration.
- Validation is limited to WorldView-3 and QuickBird sensors; generalization to additional sensors remains to be verified.
- Pseudo-supervision \(\mathcal{L}_{\text{ship}}\) depends on the quality of the pre-trained SHIP network.
Related Work & Insights¶
- CLIP-Adapter / CoOp / LoRA-CLIP: Parameter-efficient fine-tuning strategies; this work follows the adapter paradigm.
- RS-CLIP / GeoCLIP: CLIP adaptation for remote sensing; this work extends the approach to the pansharpening task.
- Insights: The proposed paradigm can be generalized to other remote sensing image fusion tasks (e.g., hyperspectral–multispectral fusion); the idea of "using protocol text as supervision" is also applicable to other tasks with well-defined rules but no available GT.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Introducing vision-language models as supervisors for pansharpening represents a strong paradigm innovation.
- Technical Depth: ⭐⭐⭐⭐ — The two-stage design is complete and the direction vector loss is elegantly conceived.
- Experimental Thoroughness: ⭐⭐⭐⭐ — 5 backbones × 2 datasets, with comprehensive ablations (5 ablation groups).
- Writing Quality: ⭐⭐⭐⭐ — Well-structured with well-motivated design choices.