Skip to content

Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift

Conference: CVPR 2026 arXiv: 2604.08956 Code: https://github.com/uga-gaim/2026_CVPRW_CloudPrompts Area: Image Segmentation Keywords: Domain Shift, Cloud Segmentation, Prompt Engineering, Low-Data Fine-Tuning, Vision-Language Models

TL;DR

This paper systematically demonstrates that prompt engineering completely fails to bridge the domain gap of vision-language models in satellite remote sensing cloud segmentation, and that fine-tuning with as little as 0.1% of labeled data (~8 images) suffices to surpass all zero-shot prompting strategies.

Background & Motivation

Background: Vision-language models (e.g., CLIP/CLIPSeg) achieve strong performance on natural images, and prompt engineering has become the dominant deployment paradigm. Approximately 70% of production AI systems rely on prompting rather than weight-level adaptation.

Limitations of Prior Work: Satellite imagery differs fundamentally from natural images — nadir viewpoints, multispectral sensors, and amorphous atmospheric phenomena (e.g., clouds, haze) stand in sharp contrast to the object-centric natural images used in CLIP pretraining. A severe linguistic gap also exists, as meteorological terminology such as "optically thin cirrus" is virtually absent from training corpora.

Key Challenge: This dual distributional shift — both visual and linguistic — constitutes a compounding mismatch. Prompt engineering presupposes that pretrained representations are sufficiently close to the target domain and that language can bridge the residual gap; this assumption fundamentally does not hold for satellite imagery.

Goal: (1) Quantify the degree to which prompt engineering fails under severe domain shift; (2) identify the minimum annotation-cost crossover point for supervised fine-tuning; (3) compare LoRA versus full fine-tuning across different data budgets.

Key Insight: Controlled experiments are conducted using CLIPSeg on the CloudSEN12+ dataset, with 60 prompt variants designed alongside fine-tuning experiments spanning data budgets from 0.1% to 100%.

Core Idea: Labeled data is not an expensive alternative to prompt engineering but rather the correct investment — just 8 annotated images suffice to outperform any prompting strategy.

Method

Overall Architecture

This paper presents a systematic empirical study rather than a novel method. The experimental pipeline consists of: (1) evaluating 60 prompt variants on CLIPSeg; (2) performing LoRA and full fine-tuning across data budgets from 0.1% to 100%; (3) analyzing per-class performance, the supervision dip phenomenon, and decision factors for method selection.

Key Designs

  1. Prompt Sensitivity Analysis Framework:

    • Function: Systematically evaluates the effectiveness of prompt engineering under domain shift.
    • Mechanism: Designs 60 prompt variants spanning four strategy categories — simple labels, domain-specific terminology, appearance descriptors, and contextual cues. Each variant is evaluated on the CloudSEN12+ test set by mIoU.
    • Design Motivation: Establishes a performance ceiling for prompt engineering, demonstrating that linguistic refinement cannot compensate for the visual-linguistic domain gap.
  2. Composite Loss Function Training:

    • Function: Addresses class imbalance in cloud segmentation.
    • Mechanism: Combines Focal Loss (\(\alpha=0.75, \gamma=2.0\)), Tversky Loss (\(\alpha_T=0.3, \beta_T=0.7\)), and Boundary Loss with weights of 0.8, 1.0, and 0.1, respectively.
    • Design Motivation: Tversky Loss penalizes false negatives more heavily, which is critical since thin clouds and cloud shadows occupy a small fraction of image area; Focal Loss handles foreground-background imbalance for easy-to-classify pixels; Boundary Loss improves cloud edge delineation.
  3. Supervision Dip Phenomenon Analysis:

    • Function: Reveals hidden risks under extremely low data budgets.
    • Mechanism: At 0.5–1% data volume, fine-tuning temporarily degrades performance on spectrally ambiguous categories (thin clouds, cloud shadows), recovering at 2.5–5% data.
    • Design Motivation: Alerts practitioners that aggregate mIoU metrics may obscure class-level performance degradation, providing more granular guidance for data budget decisions.

Loss & Training

Full fine-tuning uses a learning rate of \(5 \times 10^{-5}\) for 20 epochs; LoRA uses a learning rate of \(2 \times 10^{-4}\) with rank=32, \(\alpha=64\), for 15 epochs. Low-data experiments are averaged over 10 independent runs per data percentage.

Key Experimental Results

Main Results

Dataset / Method Metric Ours Prev. SOTA Gain
CloudSEN12+ (Zero-shot) mIoU 0.255 - Baseline
CloudSEN12+ (Best Prompt) mIoU 0.222 0.255 −12.9%
CloudSEN12+ (Worst Prompt) mIoU 0.068 0.255 −73.3%
CloudSEN12+ FFT 0.1% (~8 images) mIoU >0.255 0.255 Exceeds zero-shot
CloudSEN12+ FFT 10% mIoU 0.57 0.255 +123.5%
CloudSEN12+ FFT 100% mIoU 0.66 0.255 +158.8%
CloudSEN12+ LoRA 100% mIoU 0.60 0.255 +135.3%

Ablation Study

Configuration mIoU Notes
Zero-shot baseline 0.255 Simple label prompt
60 prompt variants 0.07–0.222 All below zero-shot
FFT vs. LoRA gap 0.04–0.07 FFT consistently leads
0.1% data FFT >0.255 ~8 images surpasses zero-shot
5–10% data FFT ~85% peak mIoU Rapid convergence zone

Key Findings

  • All 60 prompt variants fall below the zero-shot baseline; exclusive prompts (e.g., "not cloud") perform worst, as CLIP's contrastive training never endows negation with visual semantics.
  • Full fine-tuning consistently outperforms LoRA by 0.03–0.09 mIoU, with the largest gaps on spectrally ambiguous categories.
  • Performance growth follows a logarithmic curve, with diminishing returns beyond 30% data; peak marginal efficiency occurs at 2.5% data.

Highlights & Insights

  • Extremely Low Annotation Crossover Point: Just 8 annotated images suffice to outperform any prompting strategy, challenging the assumption that labeled data is an expensive substitute. For any application with severe domain shift, a small amount of annotation represents the most cost-effective investment.
  • Discovery of the Supervision Dip: At 0.5–1% data, the model temporarily degrades on difficult categories — a phenomenon masked by aggregate metrics. This finding carries important cautionary implications for all low-data fine-tuning scenarios.
  • FFT vs. LoRA Gap Stems from Representational Capacity, Not Data Efficiency: The performance gap remains stable across data budgets, indicating that full fine-tuning's advantage derives from a larger representational adaptation space.

Limitations & Future Work

  • Only RGB three-channel inputs are used, leaving Sentinel-2's multispectral capabilities unexploited (13 bands in practice); multispectral inputs may yield substantially larger gains.
  • Only the CLIPSeg model family is evaluated; while its architectural patterns are shared with LSeg/OpenSeg and similar models, direct validation on those architectures remains necessary.
  • Domain-adaptive pretraining approaches (e.g., RemoteCLIP) are not explored, which may alter the prompting-versus-fine-tuning trade-off.
  • As a CVPRW paper, the experimental scope is relatively limited.
  • vs. RemoteCLIP/RS-CLIP: These domain-adapted models require large-scale remote sensing corpora and substantial compute for pretraining, whereas this paper demonstrates that straightforward fine-tuning is superior in data efficiency.
  • vs. CoOp/CoCoOp: Learnable prompt methods optimize within the same embedding space and face the same representational ceiling — the bottleneck lies in the misalignment between the visual encoder and satellite spectral imagery.

Rating

  • Novelty: ⭐⭐⭐ Experimental findings are valuable, though the methodology itself is not novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐ 60 prompt variants + systematic data budget sweep + 10 repeated runs.
  • Writing Quality: ⭐⭐⭐⭐ Conclusions are clear and arguments are rigorous.
  • Value: ⭐⭐⭐⭐ Offers direct practical guidance for remote sensing practitioners.