Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift¶
Conference: CVPR 2026 arXiv: 2604.08956 Code: https://github.com/uga-gaim/2026_CVPRW_CloudPrompts Area: Image Segmentation Keywords: Domain Shift, Cloud Segmentation, Prompt Engineering, Low-Data Fine-Tuning, Vision-Language Models
TL;DR¶
This paper systematically demonstrates that prompt engineering completely fails to bridge the domain gap of vision-language models in satellite remote sensing cloud segmentation, and that fine-tuning with as little as 0.1% of labeled data (~8 images) suffices to surpass all zero-shot prompting strategies.
Background & Motivation¶
Background: Vision-language models (e.g., CLIP/CLIPSeg) achieve strong performance on natural images, and prompt engineering has become the dominant deployment paradigm. Approximately 70% of production AI systems rely on prompting rather than weight-level adaptation.
Limitations of Prior Work: Satellite imagery differs fundamentally from natural images — nadir viewpoints, multispectral sensors, and amorphous atmospheric phenomena (e.g., clouds, haze) stand in sharp contrast to the object-centric natural images used in CLIP pretraining. A severe linguistic gap also exists, as meteorological terminology such as "optically thin cirrus" is virtually absent from training corpora.
Key Challenge: This dual distributional shift — both visual and linguistic — constitutes a compounding mismatch. Prompt engineering presupposes that pretrained representations are sufficiently close to the target domain and that language can bridge the residual gap; this assumption fundamentally does not hold for satellite imagery.
Goal: (1) Quantify the degree to which prompt engineering fails under severe domain shift; (2) identify the minimum annotation-cost crossover point for supervised fine-tuning; (3) compare LoRA versus full fine-tuning across different data budgets.
Key Insight: Controlled experiments are conducted using CLIPSeg on the CloudSEN12+ dataset, with 60 prompt variants designed alongside fine-tuning experiments spanning data budgets from 0.1% to 100%.
Core Idea: Labeled data is not an expensive alternative to prompt engineering but rather the correct investment — just 8 annotated images suffice to outperform any prompting strategy.
Method¶
Overall Architecture¶
This paper presents a systematic empirical study rather than a novel method. The experimental pipeline consists of: (1) evaluating 60 prompt variants on CLIPSeg; (2) performing LoRA and full fine-tuning across data budgets from 0.1% to 100%; (3) analyzing per-class performance, the supervision dip phenomenon, and decision factors for method selection.
Key Designs¶
-
Prompt Sensitivity Analysis Framework:
- Function: Systematically evaluates the effectiveness of prompt engineering under domain shift.
- Mechanism: Designs 60 prompt variants spanning four strategy categories — simple labels, domain-specific terminology, appearance descriptors, and contextual cues. Each variant is evaluated on the CloudSEN12+ test set by mIoU.
- Design Motivation: Establishes a performance ceiling for prompt engineering, demonstrating that linguistic refinement cannot compensate for the visual-linguistic domain gap.
-
Composite Loss Function Training:
- Function: Addresses class imbalance in cloud segmentation.
- Mechanism: Combines Focal Loss (\(\alpha=0.75, \gamma=2.0\)), Tversky Loss (\(\alpha_T=0.3, \beta_T=0.7\)), and Boundary Loss with weights of 0.8, 1.0, and 0.1, respectively.
- Design Motivation: Tversky Loss penalizes false negatives more heavily, which is critical since thin clouds and cloud shadows occupy a small fraction of image area; Focal Loss handles foreground-background imbalance for easy-to-classify pixels; Boundary Loss improves cloud edge delineation.
-
Supervision Dip Phenomenon Analysis:
- Function: Reveals hidden risks under extremely low data budgets.
- Mechanism: At 0.5–1% data volume, fine-tuning temporarily degrades performance on spectrally ambiguous categories (thin clouds, cloud shadows), recovering at 2.5–5% data.
- Design Motivation: Alerts practitioners that aggregate mIoU metrics may obscure class-level performance degradation, providing more granular guidance for data budget decisions.
Loss & Training¶
Full fine-tuning uses a learning rate of \(5 \times 10^{-5}\) for 20 epochs; LoRA uses a learning rate of \(2 \times 10^{-4}\) with rank=32, \(\alpha=64\), for 15 epochs. Low-data experiments are averaged over 10 independent runs per data percentage.
Key Experimental Results¶
Main Results¶
| Dataset / Method | Metric | Ours | Prev. SOTA | Gain |
|---|---|---|---|---|
| CloudSEN12+ (Zero-shot) | mIoU | 0.255 | - | Baseline |
| CloudSEN12+ (Best Prompt) | mIoU | 0.222 | 0.255 | −12.9% |
| CloudSEN12+ (Worst Prompt) | mIoU | 0.068 | 0.255 | −73.3% |
| CloudSEN12+ FFT 0.1% (~8 images) | mIoU | >0.255 | 0.255 | Exceeds zero-shot |
| CloudSEN12+ FFT 10% | mIoU | 0.57 | 0.255 | +123.5% |
| CloudSEN12+ FFT 100% | mIoU | 0.66 | 0.255 | +158.8% |
| CloudSEN12+ LoRA 100% | mIoU | 0.60 | 0.255 | +135.3% |
Ablation Study¶
| Configuration | mIoU | Notes |
|---|---|---|
| Zero-shot baseline | 0.255 | Simple label prompt |
| 60 prompt variants | 0.07–0.222 | All below zero-shot |
| FFT vs. LoRA gap | 0.04–0.07 | FFT consistently leads |
| 0.1% data FFT | >0.255 | ~8 images surpasses zero-shot |
| 5–10% data FFT | ~85% peak mIoU | Rapid convergence zone |
Key Findings¶
- All 60 prompt variants fall below the zero-shot baseline; exclusive prompts (e.g., "not cloud") perform worst, as CLIP's contrastive training never endows negation with visual semantics.
- Full fine-tuning consistently outperforms LoRA by 0.03–0.09 mIoU, with the largest gaps on spectrally ambiguous categories.
- Performance growth follows a logarithmic curve, with diminishing returns beyond 30% data; peak marginal efficiency occurs at 2.5% data.
Highlights & Insights¶
- Extremely Low Annotation Crossover Point: Just 8 annotated images suffice to outperform any prompting strategy, challenging the assumption that labeled data is an expensive substitute. For any application with severe domain shift, a small amount of annotation represents the most cost-effective investment.
- Discovery of the Supervision Dip: At 0.5–1% data, the model temporarily degrades on difficult categories — a phenomenon masked by aggregate metrics. This finding carries important cautionary implications for all low-data fine-tuning scenarios.
- FFT vs. LoRA Gap Stems from Representational Capacity, Not Data Efficiency: The performance gap remains stable across data budgets, indicating that full fine-tuning's advantage derives from a larger representational adaptation space.
Limitations & Future Work¶
- Only RGB three-channel inputs are used, leaving Sentinel-2's multispectral capabilities unexploited (13 bands in practice); multispectral inputs may yield substantially larger gains.
- Only the CLIPSeg model family is evaluated; while its architectural patterns are shared with LSeg/OpenSeg and similar models, direct validation on those architectures remains necessary.
- Domain-adaptive pretraining approaches (e.g., RemoteCLIP) are not explored, which may alter the prompting-versus-fine-tuning trade-off.
- As a CVPRW paper, the experimental scope is relatively limited.
Related Work & Insights¶
- vs. RemoteCLIP/RS-CLIP: These domain-adapted models require large-scale remote sensing corpora and substantial compute for pretraining, whereas this paper demonstrates that straightforward fine-tuning is superior in data efficiency.
- vs. CoOp/CoCoOp: Learnable prompt methods optimize within the same embedding space and face the same representational ceiling — the bottleneck lies in the misalignment between the visual encoder and satellite spectral imagery.
Rating¶
- Novelty: ⭐⭐⭐ Experimental findings are valuable, though the methodology itself is not novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ 60 prompt variants + systematic data budget sweep + 10 repeated runs.
- Writing Quality: ⭐⭐⭐⭐ Conclusions are clear and arguments are rigorous.
- Value: ⭐⭐⭐⭐ Offers direct practical guidance for remote sensing practitioners.