Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift¶

Conference: CVPR 2026 arXiv: 2604.08956 Code: https://github.com/uga-gaim/2026_CVPRW_CloudPrompts Area: Image Segmentation Keywords: Domain Shift, Cloud Segmentation, Prompt Engineering, Low-Data Fine-Tuning, Vision-Language Models

TL;DR¶

This paper systematically demonstrates that prompt engineering completely fails to bridge the domain gap of vision-language models in satellite remote sensing cloud segmentation, and that fine-tuning with as little as 0.1% of labeled data (~8 images) suffices to surpass all zero-shot prompting strategies.

Background & Motivation¶

Background: Vision-language models (e.g., CLIP/CLIPSeg) achieve strong performance on natural images, and prompt engineering has become the dominant deployment paradigm. Approximately 70% of production AI systems rely on prompting rather than weight-level adaptation.

Limitations of Prior Work: Satellite imagery differs fundamentally from natural images — nadir viewpoints, multispectral sensors, and amorphous atmospheric phenomena (e.g., clouds, haze) stand in sharp contrast to the object-centric natural images used in CLIP pretraining. A severe linguistic gap also exists, as meteorological terminology such as "optically thin cirrus" is virtually absent from training corpora.

Key Challenge: This dual distributional shift — both visual and linguistic — constitutes a compounding mismatch. Prompt engineering presupposes that pretrained representations are sufficiently close to the target domain and that language can bridge the residual gap; this assumption fundamentally does not hold for satellite imagery.

Goal: (1) Quantify the degree to which prompt engineering fails under severe domain shift; (2) identify the minimum annotation-cost crossover point for supervised fine-tuning; (3) compare LoRA versus full fine-tuning across different data budgets.

Key Insight: Controlled experiments are conducted using CLIPSeg on the CloudSEN12+ dataset, with 60 prompt variants designed alongside fine-tuning experiments spanning data budgets from 0.1% to 100%.

Core Idea: Labeled data is not an expensive alternative to prompt engineering but rather the correct investment — just 8 annotated images suffice to outperform any prompting strategy.

Method¶

Overall Architecture¶

This paper presents a systematic empirical study rather than a novel method. The experimental pipeline consists of: (1) evaluating 60 prompt variants on CLIPSeg; (2) performing LoRA and full fine-tuning across data budgets from 0.1% to 100%; (3) analyzing per-class performance, the supervision dip phenomenon, and decision factors for method selection.

Key Designs¶

Prompt Sensitivity Analysis Framework:
- Function: Systematically evaluates the effectiveness of prompt engineering under domain shift.
- Mechanism: Designs 60 prompt variants spanning four strategy categories — simple labels, domain-specific terminology, appearance descriptors, and contextual cues. Each variant is evaluated on the CloudSEN12+ test set by mIoU.
- Design Motivation: Establishes a performance ceiling for prompt engineering, demonstrating that linguistic refinement cannot compensate for the visual-linguistic domain gap.
Composite Loss Function Training:
- Function: Addresses class imbalance in cloud segmentation.
- Mechanism: Combines Focal Loss (\(\alpha=0.75, \gamma=2.0\)), Tversky Loss (\(\alpha_T=0.3, \beta_T=0.7\)), and Boundary Loss with weights of 0.8, 1.0, and 0.1, respectively.
- Design Motivation: Tversky Loss penalizes false negatives more heavily, which is critical since thin clouds and cloud shadows occupy a small fraction of image area; Focal Loss handles foreground-background imbalance for easy-to-classify pixels; Boundary Loss improves cloud edge delineation.
Supervision Dip Phenomenon Analysis:
- Function: Reveals hidden risks under extremely low data budgets.
- Mechanism: At 0.5–1% data volume, fine-tuning temporarily degrades performance on spectrally ambiguous categories (thin clouds, cloud shadows), recovering at 2.5–5% data.
- Design Motivation: Alerts practitioners that aggregate mIoU metrics may obscure class-level performance degradation, providing more granular guidance for data budget decisions.

Loss & Training¶

Full fine-tuning uses a learning rate of \(5 \times 10^{-5}\) for 20 epochs; LoRA uses a learning rate of \(2 \times 10^{-4}\) with rank=32, \(\alpha=64\), for 15 epochs. Low-data experiments are averaged over 10 independent runs per data percentage.

Key Experimental Results¶

Main Results¶

Dataset / Method	Metric	Ours	Prev. SOTA	Gain
CloudSEN12+ (Zero-shot)	mIoU	0.255	-	Baseline
CloudSEN12+ (Best Prompt)	mIoU	0.222	0.255	−12.9%
CloudSEN12+ (Worst Prompt)	mIoU	0.068	0.255	−73.3%
CloudSEN12+ FFT 0.1% (~8 images)	mIoU	>0.255	0.255	Exceeds zero-shot
CloudSEN12+ FFT 10%	mIoU	0.57	0.255	+123.5%
CloudSEN12+ FFT 100%	mIoU	0.66	0.255	+158.8%
CloudSEN12+ LoRA 100%	mIoU	0.60	0.255	+135.3%

Ablation Study¶

Configuration	mIoU	Notes
Zero-shot baseline	0.255	Simple label prompt
60 prompt variants	0.07–0.222	All below zero-shot
FFT vs. LoRA gap	0.04–0.07	FFT consistently leads
0.1% data FFT	>0.255	~8 images surpasses zero-shot
5–10% data FFT	~85% peak mIoU	Rapid convergence zone

Key Findings¶

All 60 prompt variants fall below the zero-shot baseline; exclusive prompts (e.g., "not cloud") perform worst, as CLIP's contrastive training never endows negation with visual semantics.
Full fine-tuning consistently outperforms LoRA by 0.03–0.09 mIoU, with the largest gaps on spectrally ambiguous categories.
Performance growth follows a logarithmic curve, with diminishing returns beyond 30% data; peak marginal efficiency occurs at 2.5% data.

Highlights & Insights¶

Extremely Low Annotation Crossover Point: Just 8 annotated images suffice to outperform any prompting strategy, challenging the assumption that labeled data is an expensive substitute. For any application with severe domain shift, a small amount of annotation represents the most cost-effective investment.
Discovery of the Supervision Dip: At 0.5–1% data, the model temporarily degrades on difficult categories — a phenomenon masked by aggregate metrics. This finding carries important cautionary implications for all low-data fine-tuning scenarios.
FFT vs. LoRA Gap Stems from Representational Capacity, Not Data Efficiency: The performance gap remains stable across data budgets, indicating that full fine-tuning's advantage derives from a larger representational adaptation space.

Limitations & Future Work¶

Only RGB three-channel inputs are used, leaving Sentinel-2's multispectral capabilities unexploited (13 bands in practice); multispectral inputs may yield substantially larger gains.
Only the CLIPSeg model family is evaluated; while its architectural patterns are shared with LSeg/OpenSeg and similar models, direct validation on those architectures remains necessary.
Domain-adaptive pretraining approaches (e.g., RemoteCLIP) are not explored, which may alter the prompting-versus-fine-tuning trade-off.
As a CVPRW paper, the experimental scope is relatively limited.

vs. RemoteCLIP/RS-CLIP: These domain-adapted models require large-scale remote sensing corpora and substantial compute for pretraining, whereas this paper demonstrates that straightforward fine-tuning is superior in data efficiency.
vs. CoOp/CoCoOp: Learnable prompt methods optimize within the same embedding space and face the same representational ceiling — the bottleneck lies in the misalignment between the visual encoder and satellite spectral imagery.

Rating¶

Novelty: ⭐⭐⭐ Experimental findings are valuable, though the methodology itself is not novel.
Experimental Thoroughness: ⭐⭐⭐⭐ 60 prompt variants + systematic data budget sweep + 10 repeated runs.
Writing Quality: ⭐⭐⭐⭐ Conclusions are clear and arguments are rigorous.
Value: ⭐⭐⭐⭐ Offers direct practical guidance for remote sensing practitioners.